Hi everyone, we recently ran into the bug describe...
# troubleshooting
n
Hi everyone, we recently ran into the bug described in this thread, which I was able to resolve using the same approach described there of reducing the `batch_size_rows`for pipelinewise-target-redshift. Interestingly, since I made that change, we’ve now started seeing situations where meltano jobs report that they’ve completed successfully, but exit with return code 1, implying that they’ve failed when I don’t think they actually have. Has anyone else run into this phenomenon? Here’s the relevant tail of our airflow logs:
Copy code
```[2021-10-28 08:18:56,028] {bash_operator.py:157} INFO - meltano                      | elt       | Incremental state has been updated at 2021-10-28 08:18:56.027750.
[2021-10-28 08:18:58,706] {bash_operator.py:157} INFO - meltano                      | elt       | Extract & load complete!
[2021-10-28 08:18:58,707] {bash_operator.py:157} INFO - meltano                      | elt       | Transformation skipped.
[2021-10-28 08:18:58,727] {bash_operator.py:157} INFO - (<State.FAIL: (3, ('RUNNING',))>, <State.SUCCESS: (2, ())>)
[2021-10-28 08:19:00,227] {bash_operator.py:161} INFO - Command exited with return code 1
[2021-10-28 08:19:00,385] {taskinstance.py:1150} ERROR - Bash command failed
Traceback (most recent call last):
  File "/project/.meltano/orchestrators/airflow/venv/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/project/.meltano/orchestrators/airflow/venv/lib/python3.6/site-packages/airflow/operators/bash_operator.py", line 165, in execute
    raise AirflowException("Bash command failed")
airflow.exceptions.AirflowException: Bash command failed
[2021-10-28 08:19:00,387] {taskinstance.py:1194} INFO - Marking task as FAILED. dag_id=meltano_secondary_full_table, task_id=extract_load, execution_date=20211027T070000, start_date=20211028T074156, end_date=20211028T081900
[2021-10-28 08:19:01,488] {local_task_job.py:102} INFO - Task exited with return code 1
It seems noteworthy that the state getting returned here seems to be a tuple of tuples ? I don’t remember seeing that before
Also noteworthy: the corresponding record of the job in the meltano DB is also marked as failed, with the payload starting with:
{"error": "No heartbeat recorded for 5 minutes. The process was likely killed unceremoniously.", "
a
cc @florian.hines - any ideas?
f
Off hand I don’t have any immediate ideas. @nick_hamlin any chance you can bump the log level to debug for a run and see if the meltano logs bubble up anything else suspicious?
n
I can try, but the weird thing is this only happens in prod, and even then it’s inconsistent
I also haven’t spotted a pattern that correlates with the runs where it happens beyond it tends to be longer running ones
f
im not super familiar yet with the Jobs part of the code base yet - but looks like its maybe marking it as failed because it thinks its gone stale (based on https://gitlab.com/meltano/meltano/-/blob/master/src/meltano/core/job/job.py#L176)
i wonder its actually completing, and tripping a bug/race where state never gets updated to finished (or something along those lines)
let me get an issue wrote up and going for this - I’m honestly not sure what might be causing this at a glance - so I’ll have to poke around a bit. I think you’re right though, they probably actually are completing successfully. Otherwise I don’t think you’d get the
Extract & load complete!
log entry.
n
Thanks @florian.hines - that’s how I interpreted that “extract and load complete” line as well. Please let me know if there’s anything specific I can try to dig up to help from my end beyond trying to get it to happen again with debug logs on
f
👍 will do, and thanks for reporting!