Hi, I am experiencing Meltano is stuck on RUNNING ...
# troubleshooting
o
Hi, I am experiencing Meltano is stuck on RUNNING STATE when I am running it with the cli as airflow task and the task is interrupted for some reason. In this case I have to manually add the --force to override the state which I prefer not to do. I am using
v1.77.0
. I also saw a discussion on this in the following thread
Getting the following error
Copy code
Another 'elt_my_meltano' pipeline is already running which started at 2022-04-05 06:23:52.597229. To ignore this check use the '--force' option.
n
@or_barda, as far as I know, this is still an outstanding issue. It doesn’t come up too often for us, but when it does, I’ve been able to resolve using the workaround I described in the original thread: connect to the underlying postgres database and manually update the job record as if it has completed as normal. This definitely isn’t ideal, but I can confirm that it at least unblocks subsequent runs without needing to use --force
o
@nick_hamlin thank you foe the update. @douwe_maan is there any other resolution for this scenario? Is it something you are working on?
d
@aaronsteers Can you please have a look here?
a
@or_barda - Just catching up here. A couple quick Qs... 1. Is there are reason you do not want to use
--force
? 2. Is there a specific way the jobs are aborting which is causing the job record to be in an orphaned/abandoned state?
n
@aaronsteers, I can’t speak for @or_barda, but I can answer those two questions from our perspective: 1. My understanding is that using
--force
for regularly scheduled jobs is an antipattern since it has the potential for jobs to potentially “step on each others toes” in unexpected ways. It makes sense to me why that would be something to be avoided, but please correct me if that’s not case. 2. Yes, As far as I can tell, this happens when a job is running “directly” via airflow (as opposed to the meltano UI’s wrapping of airflow) and an issue occurs that causes that job to be disrupted while it’s in process. For us, this most recently happened when we had an issue with the AWS hardware on which the underlying meltano postgres db was running and it lost connection to the running airflow service (it’s been pretty infrequent - maybe coming up once or twice since I put the original issue in?)
d
Sounds like the stale job detection I introduced in https://gitlab.com/meltano/meltano/-/merge_requests/2000 isn’t working properly or running at all
o
Thanks @nick_hamlin I totally agree with what you wrote. @aaronsteers using
--force
flag is a backdoor in special cases like this but that’s should happen only in rare cases. My concern is when a task is stopped in airflow the metlano is not shutting down properly and I have ti use this flag, although airflow does send the SIGTERM when shutting down the task
d
@or_avidov The SIGTERM should mark the job as failed immediately, but even when that doesn’t work the stale detection should still pick it up within 5 minutes, because a running job updates a heartbeat in the DB every second or so.
We’d have to debug why that query against the system DB isn’t returning the job row with the old hearbeat
n
FWIW, I do know that that stale job detection is running in at least some capacity, since I’ve had other situations (not in prod) where I’ve started a job running locally, stopped it manually for some reason, immediately tried again and gotten the “you need to use
--force
message”, waited a few minutes, and had everything work fine
a
Thanks for this context. A few takeaways, I think: First, we should probably check Airflow wrappers to see if the behavior of its job canceling (SIGTERM) is not properly shutting down meltano is an exception or the norm. Perhaps there are specific conditions that cause the job abort to not be logged? As noted:
My concern is when a task is stopped in airflow the metlano is not shutting down properly
Second, we can check if stale detection is still working correctly to ignore/clear jobs with a heartbeat older than 5 minutes. (Might need to check timezone logic to make sure that's not a factor here.) And lastly, just to confirm: are we still aligned that within five minutes of a job being canceled, the
--force
flag is still okay for use in these cases? (The case being one where the person running has confidence that their last job is no longer running, although the timeframe for stale detection may not yet have been reached.) Does that sound right? I'll log (or dig up!) an issue on those two first points if so.
d
And lastly, just to confirm: are we still aligned that within five minutes of a job being canceled, the
--force
flag is still okay for use in these cases?
Agreed, it’s a valid workaround when stale detection hasn’t triggered yet
o
Can you please explain how the stale detection works? Let’s say I am triggering a
meltano run elt
using a DockerOperator and this task is shutdown by airflow sometime in the middle. What is the expected behavior?
d
@or_avidov While running, the heartbeat timestamp in the jobs table is updated every second or so. At the beginning of
meltano elt
,
meltano schedule --list
and a few others (possible
meltano run
@aaronsteers?) Meltano then marks all jobs with a heartbeat older than 5 minutes as failed. So it does depend on one of those other commands running semi-regularly
It’s possible that the issue is that stale detection isn’t triggered by
meltano run
, but I haven’t checked
o
Actually it is my mistake we are running only
meltano elt
not
meltano run elt
a
@or_barda - No problem. I was assuming
meltano elt
but thanks for confirming.
@douwe_maan - Good questions and I will open an issue to follow up and get confirmation here on the stale detection.
d
@or_barda OK,
meltano elt
definitely runs the stale job check, so that’s not the issue here, although it’s still good to ensure
meltano run
does it too
o
Do you know from what version it is supported?
d
My MR was merged over a year ago, so I hope you’re not that far behind the release schedule 🙂
o
The oldest version I am using is
v.1.77.0
d
This went into 1.66.0
o
Got it. Thanks