I am having this issue where a certain pipeline shows as run Meltano #troubleshooting

I am having this issue where a certain pipeline sh...

gunnar

05/10/2021, 3:50 PM

I am having this issue where a certain pipeline shows as running, but has no logs and isn't doing anything. It shows it is running in the Meltano UI and in the Airflow UI. I have tried deleting the dag and pipeline and any logs associated, however, when I recreate the scheduled pipeline, both UI's just repopulate what was there before and continue to show it as running. I looked through the issue board and it doesn't seem like there is currently a way to cancel or end a pipelines run.

gunnar

05/10/2021, 3:52 PM

message has been deleted

nick_hamlin

05/10/2021, 3:57 PM

I just ran into this and got around it by manually deleting the job record in the meltano DB. More context here: https://meltano.slack.com/archives/C01TCRBBJD7/p1620654893103500?thread_ts=1620221119.064500&cid=C01TCRBBJD7

douwe_maan

05/10/2021, 4:11 PM

Stale "running" jobs in Meltano (rows in the

job

table in the system database) should automatically be marked as "failed" after 5 minutes: https://gitlab.com/meltano/meltano/-/merge_requests/2000

douwe_maan

05/10/2021, 4:12 PM

If that's not happening, we may be looking at a bug! @gunnar Can you share the contents of the

job

table row for this stuck pipeline? That will help me figure out why it's not being detected as being stale.

gunnar

05/10/2021, 4:45 PM

It wasn't ending after 5 minutes (had been over a day), but @nick_hamlin’s solution seemed to fix the issue! Thank you! The only other thing I was curious about was if Meltano supports multiple pipelines / dags to be run at the same time. I assumed yes and that this would be more related to Airflow. However, I wanted to quickly ask just because I don't want to mess around with anything and break any integrations if it is not yet supported.

douwe_maan

05/10/2021, 4:45 PM

It wasn't ending after 5 minutes (had been over a day), but @nick_hamlin’s solution seemed to fix the issue!

I would've loved to see the

job

record before you deleted it so that I could figure out why it wasn't getting marked stale after 5 min 😄

douwe_maan

05/10/2021, 4:46 PM

The only other thing I was curious about was if Meltano supports multiple pipelines / dags to be run at the same time.

Yep, definitely, you can have multiple distinct pipelines running at the same time.

nick_hamlin

05/10/2021, 4:46 PM

but they must have unique

job_id

values (this was part of my problem)

nick_hamlin

05/10/2021, 4:48 PM

and if it’s helpful, I can confirm that the rows I deleted in my fix had been hanging around longer than 5m. if @gunnar isn’t able to, let me know and I bet I can reproduce it locally and get you the full entry

douwe_maan

05/10/2021, 4:48 PM

@nick_hamlin That'd be great, please do.

nick_hamlin

05/10/2021, 5:01 PM

well, I may have spoken too soon - I’m following the same steps I was over the weekend, but (at least for now), nothing seems to be getting stuck anymore!

nick_hamlin

05/10/2021, 5:02 PM

Not sure what changed (clearly something has), but I’ll save the entry for you if I’m able to get it to happen again

gunnar

05/10/2021, 5:14 PM

Ah sorry! I think I still have the printout in my console, I can check in a second. However, I think the issue was because of multiple changes I made to the project and re-creating pipelines with the same setup and naming convention.

gunnar

05/10/2021, 5:17 PM

Would running multiple pipelines best be done in Meltano UI (/Meltano CLI) or Airflow UI? I noticed that it seems like Airflow setup a queue for the DAGS which is why I am asking. (Some DAGS show up as either queued or scheduled) Far right (slightly brown) is scheduled third from right in grey is queued

gunnar

05/10/2021, 5:48 PM

I tried trigger via command line using "meltano invoke airflow dags trigger ___" I will let you know if it works. So far in airflow: externally trigger = true However, Meltano UI has not updated.

casey

06/21/2021, 3:44 PM

I'm running into the same issue using 1.76. Here's the list of running jobs from the

job

table:

Copy code

meltano=> select id, run_id, state, started_at, ended_at, trigger, last_heartbeat_at from job where state ='RUNNING';
 id |                run_id                |  state  |         started_at         | ended_at | trigger |     last_heartbeat_at      
----+--------------------------------------+---------+----------------------------+----------+---------+----------------------------
 21 | 6476cd38-f490-4df5-b3e8-e45f1ba324da | RUNNING | 2021-06-18 01:50:24.108557 |          | ui      | 2021-06-18 01:58:49.481304
 20 | 017c90f8-696e-4b6c-8c75-063c0b1e34dc | RUNNING | 2021-06-18 01:50:23.99533  |          | ui      | 2021-06-18 01:58:50.118105
 22 | b3b2d854-acaa-40e5-9888-7c3115126321 | RUNNING | 2021-06-18 01:50:24.231653 |          | ui      | 2021-06-18 01:58:50.224987

douwe_maan

06/21/2021, 3:49 PM

@casey If you run

meltano schedule list

, do you see anything logged about stale jobs being marked as such?

casey

06/21/2021, 3:50 PM

Here's the output:

Copy code

root@ledp-vm:/project# meltano schedule list
[@once] ip-hash: tap-postgres--telemetry → pipelinewise-target-bigquery--telemetry → transforms
[@once] kdp-el: tap-postgres--kdp → pipelinewise-target-bigquery--kdp → transforms
[@once] studio-el: tap-postgres--studio → pipelinewise-target-bigquery--studio → transforms

douwe_maan

06/21/2021, 3:53 PM

@casey OK, let’s debug this a little. Can you run

meltano repl

and then:

Copy code

from meltano.core.job.job_finder import JobFinder
JobFinder.all_stale()

casey

06/21/2021, 3:54 PM

Hmm, it can't find that module.

Copy code

root@ledp-vm:/project# meltano repl
Python 3.7.10 (default, May 12 2021, 16:05:48) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.

Booting import Meltano REPL


In [1]: from meltano.core.job.job_finder import JobFinder
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-d78d2d414e8f> in <module>
----> 1 from meltano.core.job.job_finder import JobFinder

ModuleNotFoundError: No module named 'meltano.core.job.job_finder'

casey

06/21/2021, 3:57 PM

From the way the archive is packaged, it looks like it might be

meltano.core.job.finder

douwe_maan

06/21/2021, 3:58 PM

Ah yes, my bad

casey

06/21/2021, 4:00 PM

all_stale()

takes a

session

parameter... what should I provide it with?

douwe_maan

06/21/2021, 4:00 PM

That’s what I get for suggesting some code to run without testing it 🙂

douwe_maan

06/21/2021, 4:00 PM

There’s a local

session

variable you can use already

casey

06/21/2021, 4:00 PM

hahaha... no worries

casey

06/21/2021, 4:03 PM

it returns a reference to a SQLAlchemy Query object... sorry, my SQLAlchemy knowledge is weak, so I'm not sure what to do with this

casey

06/21/2021, 4:04 PM

Here's the pretty-printed form:

Copy code

SELECT job.id AS job_id_1, job.job_id AS job_job_id, job.run_id AS job_run_id, job.state AS job_state, job.started_at AS job_started_at, job.last_heartbeat_at AS job_last_heartbeat_at, job.ended_at AS job_ended_at, job.payload AS job_payload, job.payload_flags AS job_payload_flags, job."trigger" AS job_trigger 
FROM job 
WHERE job.state = ? AND (job.last_heartbeat_at IS NOT NULL AND job.last_heartbeat_at < ? OR job.last_heartbeat_at IS NULL AND job.started_at < ?)

douwe_maan

06/21/2021, 4:05 PM

Ok, that’s what I’d expect. Let’s see if that query returns any rows:

list(JobFinder.all_stale(session))

casey

06/21/2021, 4:07 PM

Nope...

Copy code

In [10]: list(JobFinder.all_stale(session))
Out[10]: []

In [11]:

douwe_maan

06/21/2021, 4:20 PM

All right, now that is odd!

douwe_maan

06/21/2021, 4:22 PM

@casey Can you run the query directly against the DB?

Copy code

SELECT job.id AS job_id_1, job.job_id AS job_job_id, job.run_id AS job_run_id, job.state AS job_state, job.started_at AS job_started_at, job.last_heartbeat_at AS job_last_heartbeat_at, job.ended_at AS job_ended_at, job.payload AS job_payload, job.payload_flags AS job_payload_flags, job."trigger" AS job_trigger 
FROM job 
WHERE job.state = "RUNNING" AND (job.last_heartbeat_at IS NOT NULL AND job.last_heartbeat_at < "2021-06-20" OR job.last_heartbeat_at IS NULL AND job.started_at < "2021-06-20")

casey

06/21/2021, 4:30 PM

for reasons I don't understand, Slack won't let me send the results in a message. Let me try and put them in a snippet.

casey

06/21/2021, 4:30 PM

Sorry about the formatting

jobs.sql

douwe_maan

06/21/2021, 4:32 PM

Hmm, so now it can find the stale jobs, but it couldn’t from

JobFinder

douwe_maan

06/21/2021, 4:32 PM

The only difference I see is that I hard-coded the dates instead of using Python to determine those

douwe_maan

06/21/2021, 4:33 PM

meltano repo

, can you run:

Copy code

from datetime import datetime, timedelta
now = datetime.utcnow()
last_valid_heartbeat_at = now - timedelta(minutes=HEARTBEAT_VALID_MINUTES)
last_valid_started_at = now - timedelta(hours=HEARTBEATLESS_JOB_VALID_HOURS)

casey

06/21/2021, 4:56 PM

do I need to import a module that declares/defines

HEARTBEAT_VALID_MINUTES

casey

06/21/2021, 4:58 PM

nm, I think it's in

job.py

casey

06/21/2021, 5:13 PM

Copy code

In [15]: print(last_valid_heartbeat_at)
2021-06-21 16:50:07.654660

In [16]: print(last_valid_started_at)
2021-06-20 16:55:07.654660

casey

06/21/2021, 5:14 PM

the heartbeat value differs substantially from what's in the DB

douwe_maan

06/21/2021, 5:52 PM

If you enter those values into the query in https://meltano.slack.com/archives/C01TCRBBJD7/p1624292529338800?thread_ts=1620661818.107900&cid=C01TCRBBJD7, do you get 0 records or more?

casey

06/21/2021, 5:57 PM

This query:

Copy code

SELECT job.id AS job_id_1, job.job_id AS job_job_id, job.run_id AS job_run_id, job.state AS job_state, job.started_at AS job_started_at, job.last_heartbeat_at AS job_last_heartbeat_at, job.ended_at AS job_ended_at, job.payload AS job_payload, job.payload_flags AS job_payload_flags, job.trigger AS job_trigger FROM job WHERE job.state = 'RUNNING' AND (job.last_heartbeat_at IS NOT NULL AND job.last_heartbeat_at < '2021-06-21 16:50:07.654660' OR job.last_heartbeat_at IS NULL AND job.started_at < '2021-06-20 16:55:07.654660')

returns 3 records

douwe_maan

06/21/2021, 5:57 PM

Ok. But when run from Python, SQLAlchemy found 0 😕

douwe_maan

06/21/2021, 5:57 PM

Can you please file an issue fo this with everything we’ve found so far? This’ll require some deeper debugging

casey

06/21/2021, 5:58 PM

Sure

casey

06/21/2021, 9:44 PM

https://gitlab.com/meltano/meltano/-/issues/2812

casey

06/21/2021, 9:44 PM

I hope I described/characterized things correctly

douwe_maan

06/21/2021, 9:54 PM

@casey Thanks a lot, I’ve called in @aaronsteers to help debug this. For debug purposes, it would help if you left the DB in the current (broken) state, but to solve the issue you can manually change the state on these jobs from

RUNNING

FAIL

casey

06/21/2021, 9:59 PM

I'm sorry man. I just deleted the jobs entirely. I did, however, run a sql export of the database before doing so. I'll attach it to the ticket.

douwe_maan

06/21/2021, 9:59 PM

No worries, thanks!

Open in Slack

Previous Next