Hi folks, I am noticing that meltano is updating m...
# troubleshooting
s
Hi folks, I am noticing that meltano is updating my state.json to empty json in case it encounters any error, and then next time it does a full load since it forgets the last bookmark. I am using s3 as the state backend. What could be wrong here ?
e
Hi @silverbullet1! Can you share some details about your pipeline? • Operating System • Meltano version (
meltano --version
) • Tap and target if they're public (in the Hub or a public repo) • The command that failed
s
@edgar_ramirez_mondragon I am using a dockerized build with base image as python-3.11. Meltano version is 2.19.1. Target is postgres and Tap is redshift(custom sdk connector). The command that failed was a simple meltano run. The pipeline was working fine initially but all of a sudden we got this error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 122: unexpected end of data"...and then my state.json became empty. Next runs did a full load of data and failed because of resource scarcity(I got some -9 ec on output). Is this some issue with the way tap/target handles state or in general meltano behaviour? This might be a critical bug for us 😕
Full error: [2023-09-01, 093137 UTC] {pod_launcher.py:159} INFO - [2m2023-09-01T093137.878597Z[0m [[32m[1minfo [0m] [1mUnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 122: unexpected end of data[0m [36mcmd_type[0m=[35melb[0m [36mconsumer[0m=[35mFalse[0m [36mname[0m=[35mtap-redshift[0m [36mproducer[0m=[35mTrue[0m [36mstdio[0m=[35mstderr[0m [36mstring_id[0m=[35mtap-redshift[0m Regardless of any error, i would never want my state to be reset.
u
Thanks for the details. I don't think this is a widespread bug, or at least I can't replicate with other SDK-based taps and target-jsonl. A failure in the tap doesn't produce an empty state at the end of the pipeline so I think it's either a very specific error in the tap, or the tap emitting some weird bytes to stdout. Note that you can recover state from the system database in the
runs
table. Which variant of target-postgres are you using?
s
@edgar_ramirez_mondragon thanks for the info. I am using the meltano labs sdk connector for target-postgres. Recovering via system database might be tricky as the db is not stateful..every execution is fresh. I don't think it could be an issue with tap either because i have made very little changes following the tutorial.. nothing related to state as such.
@edgar_ramirez_mondragon any more thoughts here ?
Another weird thing is that, when I restored my state.json to last working one, I was not able to reproduce this error, it just ran fine. My current setup is meltano running in airflow, so we have around some retries on job level as well. I can see after first failure itself, the state got emptied... Then subsequent runs try to read empty state and do a full refresh, hence they fail because we don't have that many resources.
Copy code
2023-09-09, 08:03:01 UTC] {pod_launcher.py:159} INFO - [2m2023-09-09T08:03:01.405343Z[0m [[32m[1minfo     [0m] [1msmart_open.s3.MultipartWriter('mybucket', 'meltano-state/dev:tap-redshift-to-target-postgres/lock'): uploading part_num: 1, 17 bytes (total 0.000GB)[0m
[2023-09-09, 08:03:01 UTC] {pod_launcher.py:159} INFO - [2m2023-09-09T08:03:01.552417Z[0m [[32m[1minfo     [0m] [1msmart_open.s3.MultipartWriter('mybucket', 'meltano-state/dev:tap-redshift-to-target-postgres/state.json'): uploading part_num: 1, 50 bytes (total 0.000GB)[0m
[2023-09-09, 08:03:01 UTC] {pod_launcher.py:159} INFO - [2m2023-09-09T08:03:01.837343Z[0m [[32m[1minfo     [0m] [1mIncremental state has been updated at 2023-09-09 08:03:01.837211.[0m
[2023-09-09, 08:03:01 UTC] {pod_launcher.py:159} INFO - [2m2023-09-09T08:03:01.852097Z[0m [[31m[1merror    [0m] [1mExtractor failed[0m
[2023-09-09, 08:03:01 UTC] {pod_launcher.py:159} INFO - [2m2023-09-09T08:03:01.852510Z[0m [[31m[1merror    [0m] [1mBlock run completed.          [0m [36mblock_type[0m=[35mExtractLoadBlocks[0m [36merr[0m=[35mRunnerError('Extractor failed')[0m [36mexit_codes[0m=[35m{<PluginType.EXTRACTORS: 'extractors'>: -9}[0m [36mset_number[0m=[35m0[0m [36msuccess[0m=[35mFalse[0m
[
i can handle this via a hack, by backing up state.json before running and on failure upload it back, but trying to understand root cause here
u
The way I'd try to reproduce this is by sending an invalid message after the last state is sent by the tap, e.g.
Copy code
{"type": "SCHEMA", "schema": ...}
{"type": "RECORD", "record": ...}
{"type": "STATE", ...}
Not valid JSON
If that results in the target generating a bad bookmark, then it's definitely a bug in the target (and maybe the tap if it's outputting a similar invalid line under some circumstances)
s
how can I simulate a message? and why would the message get corrupt ?
u
e.g. by calling
print(...)
in your tap
s
I think I found the issue. It has something to do with the initial connection establishment with redshift. I noticed this behaviour on local as well...intermittently getting
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 122: unexpected end of data
. Just to clarify, I did not have any state in my local. Here is the complete log:
Logs
u
Ok, that I can replicate
Copy code
>>> str(b"\xf0", "utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: unexpected end of data
https://github.com/aws/amazon-redshift-python-driver/blob/a898554b4dde2cdc6a1c6214eb43b92e36540bcf/redshift_connector/utils/type_utils.py#L60-L61
s
Dang! So what would be the easiest fix for this 🤔
Looks like an issue is already open on their repo. https://github.com/aws/amazon-redshift-python-driver/issues/157
They mentioned that the latest redshift patch should fix it but I don't think it is. The other alternative is to add retry. This can be added in 2 places, first is in meltano SDK functions or tap code...but the problem is that the connection can fail anywhere. Sometimes it happens when running
discovery
, other times while fetching table from connector. We will have to patch at a lot of places. The other place is on the job level, which we already have. The only problem which is not clear to me is why the state is being written to
{}
when this problem occurs. If that is fixed then job level retry will fix the problem!
e
So it fails even on Redshift 1.0.54899+?
s
Yup. People have just implemented retry logic so maybe they haven't noticed. But I got the error intermittently today too.
What do you suggest as the next course of action?
e
I personally need to give https://docs.sqlalchemy.org/en/20/faq/connections.html#how-do-i-retry-a-statement-execution-automatically a good read. Could be a good solution somewhere in there.