Hi I am experiencing Meltano is stuck on RUNNING STATE when Meltano #troubleshooting

Hi, I am experiencing Meltano is stuck on RUNNING ...

or_barda

04/18/2022, 4:37 PM

Hi, I am experiencing Meltano is stuck on RUNNING STATE when I am running it with the cli as airflow task and the task is interrupted for some reason. In this case I have to manually add the --force to override the state which I prefer not to do. I am using

v1.77.0

. I also saw a discussion on this in the following thread

or_barda

04/18/2022, 5:53 PM

Getting the following error

Copy code

Another 'elt_my_meltano' pipeline is already running which started at 2022-04-05 06:23:52.597229. To ignore this check use the '--force' option.

nick_hamlin

04/18/2022, 10:32 PM

@or_barda, as far as I know, this is still an outstanding issue. It doesn’t come up too often for us, but when it does, I’ve been able to resolve using the workaround I described in the original thread: connect to the underlying postgres database and manually update the job record as if it has completed as normal. This definitely isn’t ideal, but I can confirm that it at least unblocks subsequent runs without needing to use --force

or_barda

04/19/2022, 12:50 PM

@nick_hamlin thank you foe the update. @douwe_maan is there any other resolution for this scenario? Is it something you are working on?

douwe_maan

04/19/2022, 1:35 PM

@aaronsteers Can you please have a look here?

aaronsteers

04/19/2022, 6:01 PM

@or_barda - Just catching up here. A couple quick Qs... 1. Is there are reason you do not want to use

--force

? 2. Is there a specific way the jobs are aborting which is causing the job record to be in an orphaned/abandoned state?

nick_hamlin

04/19/2022, 6:06 PM

@aaronsteers, I can’t speak for @or_barda, but I can answer those two questions from our perspective: 1. My understanding is that using

--force

for regularly scheduled jobs is an antipattern since it has the potential for jobs to potentially “step on each others toes” in unexpected ways. It makes sense to me why that would be something to be avoided, but please correct me if that’s not case. 2. Yes, As far as I can tell, this happens when a job is running “directly” via airflow (as opposed to the meltano UI’s wrapping of airflow) and an issue occurs that causes that job to be disrupted while it’s in process. For us, this most recently happened when we had an issue with the AWS hardware on which the underlying meltano postgres db was running and it lost connection to the running airflow service (it’s been pretty infrequent - maybe coming up once or twice since I put the original issue in?)

douwe_maan

04/19/2022, 6:26 PM

Sounds like the stale job detection I introduced in https://gitlab.com/meltano/meltano/-/merge_requests/2000 isn’t working properly or running at all

or_barda

04/19/2022, 6:27 PM

Thanks @nick_hamlin I totally agree with what you wrote. @aaronsteers using

--force

flag is a backdoor in special cases like this but that’s should happen only in rare cases. My concern is when a task is stopped in airflow the metlano is not shutting down properly and I have ti use this flag, although airflow does send the SIGTERM when shutting down the task

douwe_maan

04/19/2022, 6:29 PM

@or_avidov The SIGTERM should mark the job as failed immediately, but even when that doesn’t work the stale detection should still pick it up within 5 minutes, because a running job updates a heartbeat in the DB every second or so.

douwe_maan

04/19/2022, 6:29 PM

We’d have to debug why that query against the system DB isn’t returning the job row with the old hearbeat

nick_hamlin

04/19/2022, 6:34 PM

FWIW, I do know that that stale job detection is running in at least some capacity, since I’ve had other situations (not in prod) where I’ve started a job running locally, stopped it manually for some reason, immediately tried again and gotten the “you need to use

--force

message”, waited a few minutes, and had everything work fine

aaronsteers

04/19/2022, 6:45 PM

Thanks for this context. A few takeaways, I think: First, we should probably check Airflow wrappers to see if the behavior of its job canceling (SIGTERM) is not properly shutting down meltano is an exception or the norm. Perhaps there are specific conditions that cause the job abort to not be logged? As noted:

My concern is when a task is stopped in airflow the metlano is not shutting down properly

Second, we can check if stale detection is still working correctly to ignore/clear jobs with a heartbeat older than 5 minutes. (Might need to check timezone logic to make sure that's not a factor here.) And lastly, just to confirm: are we still aligned that within five minutes of a job being canceled, the

--force

flag is still okay for use in these cases? (The case being one where the person running has confidence that their last job is no longer running, although the timeframe for stale detection may not yet have been reached.) Does that sound right? I'll log (or dig up!) an issue on those two first points if so.

douwe_maan

04/19/2022, 6:45 PM

And lastly, just to confirm: are we still aligned that within five minutes of a job being canceled, the
--force
flag is still okay for use in these cases?

Agreed, it’s a valid workaround when stale detection hasn’t triggered yet

or_barda

04/19/2022, 6:49 PM

Can you please explain how the stale detection works? Let’s say I am triggering a

meltano run elt

using a DockerOperator and this task is shutdown by airflow sometime in the middle. What is the expected behavior?

douwe_maan

04/19/2022, 6:50 PM

@or_avidov While running, the heartbeat timestamp in the jobs table is updated every second or so. At the beginning of

meltano elt

meltano schedule --list

and a few others (possible

meltano run

@aaronsteers?) Meltano then marks all jobs with a heartbeat older than 5 minutes as failed. So it does depend on one of those other commands running semi-regularly

douwe_maan

04/19/2022, 6:50 PM

It’s possible that the issue is that stale detection isn’t triggered by

meltano run

, but I haven’t checked

or_barda

04/19/2022, 6:53 PM

Actually it is my mistake we are running only

meltano elt

not

meltano run elt

aaronsteers

04/19/2022, 6:54 PM

@or_barda - No problem. I was assuming

meltano elt

but thanks for confirming.

aaronsteers

04/19/2022, 6:55 PM

@douwe_maan - Good questions and I will open an issue to follow up and get confirmation here on the stale detection.

douwe_maan

04/19/2022, 6:56 PM

@or_barda OK,

meltano elt

definitely runs the stale job check, so that’s not the issue here, although it’s still good to ensure

meltano run

does it too

or_barda

04/19/2022, 7:00 PM

Do you know from what version it is supported?

douwe_maan

04/19/2022, 7:01 PM

My MR was merged over a year ago, so I hope you’re not that far behind the release schedule 🙂

or_barda

04/19/2022, 7:03 PM

The oldest version I am using is

v.1.77.0

douwe_maan

04/19/2022, 7:04 PM

This went into 1.66.0

or_barda

04/19/2022, 7:04 PM

Got it. Thanks

aaronsteers

04/21/2022, 3:57 PM

Logged: Jobs canceled via Airflow may not be properly aborted in logs (#3428) · Issues · Meltano / Meltano · GitLab

Open in Slack

Previous Next