Hi (sorry for repost, but got no answers last time...
# troubleshooting
v
Hi (sorry for repost, but got no answers last time) So we are having some issues with a failing job. Every morning at 04.00 we are trying to run a
--full-refresh
and i do get this error since about a week ago. The error takes about 8 minutes to occur. The next time it starts it runs, it is without the full refresh and it ends up running for about 15 minutes resulting in the same error. After 3-5 retries it finally "catches up" and manages to run the whole job.
Copy code
2023-01-19 05:09:06.508 CET
Run invocation could not be completed as block failed: Another 'prod:tap-postgres-to-target-bigquery' pipeline is already running which started at 2023-01-19 04:02:29.710681. To ignore this check use the '--force' option.
2023-01-19 05:09:06.597 CET
Client closed local connection on 127.0.0.1:5432
2023-01-19 05:09:06.903 CET
Another 'prod:tap-postgres-to-target-bigquery' pipeline is already running which started at 2023-01-19 04:02:29.710681. To ignore this check use the '--force' option.
2023-01-19 05:09:06.903 CET
Block run completed.
2023-01-19 05:09:07.777 CET
Received TERM signal. Waiting up to 0s before terminating.
So today i went digging in the database i fund the "runs" log, excellent. "Fun fact" is that there is no job starting between the first 4 am job (sorry for timezone difference) and when it dies
Copy code
started   started_at                    ended_at                      last_heartbeat_at
SUCCESS,  2023-01-19 12:01:08.914036,   2023-01-19 12:02:13.786125,   2023-01-19 12:02:12.945424
SUCCESS,  2023-01-19 11:45:09.444758,   2023-01-19 11:46:08.446500,   2023-01-19 11:46:08.381997
SUCCESS,  2023-01-19 11:31:07.112609,   2023-01-19 11:32:07.804656,   2023-01-19 11:32:06.933957
SUCCESS,  2023-01-19 11:28:38.761015,   2023-01-19 11:29:40.734920,   2023-01-19 11:29:40.691249
SUCCESS,  2023-01-19 10:16:09.787328,   2023-01-19 11:27:04.277290,   2023-01-19 11:27:04.179192
FAIL,     2023-01-19 08:30:38.460719,   2023-01-19 10:16:09.712441,   2023-01-19 09:58:09.176293
FAIL,     2023-01-19 08:02:44.723517,   2023-01-19 08:30:38.385950,   2023-01-19 08:17:06.586861
FAIL,     2023-01-19 04:02:29.710681,   2023-01-19 08:02:44.642756,   2023-01-19 04:08:56.203939
Any suggestions on how to find where the mysterious state comes from? (edited)
p
Hey @victor_lindell sorry you didnt get a response last time - I'm not sure I fully understand the situation, can you tell me more? It sounds like at 4am you're trying to do a full sync of your source db and it fails to finish. Then the next incremental run also fails. Then the one after that succeeds. And you're guessing that through all those failures each are making some progress so it ultimately catches up and starts succeeding again? Do I have that right?
How are you running it? Do you have any time limit on how long it can run for?
v
thanks for your reply @pat_nadolny Yes, this was my theory at first, but I did find one correlation. So we run it in gcp in a kubernetes cluster via cronjobs, the cluster is in autopilot mode. So when another job is done the cluster autoscales down and "moves" the job to another machine. Since meltano does not know about that it is being started on another node, it fails. This is the theory at least, trying to verify now.