Hi everyone we recently ran into the bug described in this

Hi everyone, we recently ran into the bug describe...

nick_hamlin

10/28/2021, 1:34 PM

Hi everyone, we recently ran into the bug described in this thread, which I was able to resolve using the same approach described there of reducing the `batch_size_rows`for pipelinewise-target-redshift. Interestingly, since I made that change, we’ve now started seeing situations where meltano jobs report that they’ve completed successfully, but exit with return code 1, implying that they’ve failed when I don’t think they actually have. Has anyone else run into this phenomenon? Here’s the relevant tail of our airflow logs:

Copy code

```[2021-10-28 08:18:56,028] {bash_operator.py:157} INFO - meltano                      | elt       | Incremental state has been updated at 2021-10-28 08:18:56.027750.
[2021-10-28 08:18:58,706] {bash_operator.py:157} INFO - meltano                      | elt       | Extract & load complete!
[2021-10-28 08:18:58,707] {bash_operator.py:157} INFO - meltano                      | elt       | Transformation skipped.
[2021-10-28 08:18:58,727] {bash_operator.py:157} INFO - (<State.FAIL: (3, ('RUNNING',))>, <State.SUCCESS: (2, ())>)
[2021-10-28 08:19:00,227] {bash_operator.py:161} INFO - Command exited with return code 1
[2021-10-28 08:19:00,385] {taskinstance.py:1150} ERROR - Bash command failed
Traceback (most recent call last):
  File "/project/.meltano/orchestrators/airflow/venv/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/project/.meltano/orchestrators/airflow/venv/lib/python3.6/site-packages/airflow/operators/bash_operator.py", line 165, in execute
    raise AirflowException("Bash command failed")
airflow.exceptions.AirflowException: Bash command failed
[2021-10-28 08:19:00,387] {taskinstance.py:1194} INFO - Marking task as FAILED. dag_id=meltano_secondary_full_table, task_id=extract_load, execution_date=20211027T070000, start_date=20211028T074156, end_date=20211028T081900
[2021-10-28 08:19:01,488] {local_task_job.py:102} INFO - Task exited with return code 1

nick_hamlin

10/28/2021, 1:35 PM

It seems noteworthy that the state getting returned here seems to be a tuple of tuples ? I don’t remember seeing that before

nick_hamlin

10/28/2021, 1:38 PM

Also noteworthy: the corresponding record of the job in the meltano DB is also marked as failed, with the payload starting with:

{"error": "No heartbeat recorded for 5 minutes. The process was likely killed unceremoniously.", "

aaronsteers

10/28/2021, 7:28 PM

cc @florian.hines - any ideas?

florian.hines

10/28/2021, 7:56 PM

Off hand I don’t have any immediate ideas. @nick_hamlin any chance you can bump the log level to debug for a run and see if the meltano logs bubble up anything else suspicious?

nick_hamlin

10/28/2021, 7:56 PM

I can try, but the weird thing is this only happens in prod, and even then it’s inconsistent

nick_hamlin

10/28/2021, 7:57 PM

I also haven’t spotted a pattern that correlates with the runs where it happens beyond it tends to be longer running ones

florian.hines

10/28/2021, 8:03 PM

im not super familiar yet with the Jobs part of the code base yet - but looks like its maybe marking it as failed because it thinks its gone stale (based on https://gitlab.com/meltano/meltano/-/blob/master/src/meltano/core/job/job.py#L176)

florian.hines

10/28/2021, 8:04 PM

i wonder its actually completing, and tripping a bug/race where state never gets updated to finished (or something along those lines)

florian.hines

10/28/2021, 8:14 PM

let me get an issue wrote up and going for this - I’m honestly not sure what might be causing this at a glance - so I’ll have to poke around a bit. I think you’re right though, they probably actually are completing successfully. Otherwise I don’t think you’d get the

Extract & load complete!

log entry.

florian.hines

10/28/2021, 8:23 PM

https://gitlab.com/meltano/meltano/-/issues/3032 /cc @aaronsteers @taylor

nick_hamlin

10/28/2021, 8:45 PM

Thanks @florian.hines - that’s how I interpreted that “extract and load complete” line as well. Please let me know if there’s anything specific I can try to dig up to help from my end beyond trying to get it to happen again with debug logs on

florian.hines

10/28/2021, 9:02 PM

👍 will do, and thanks for reporting!

8 Views

Open in Slack

Previous Next