Hi folks I am noticing that meltano is updating my state jso Meltano #troubleshooting

Hi folks, I am noticing that meltano is updating m...

silverbullet1

09/04/2023, 5:44 PM

Hi folks, I am noticing that meltano is updating my state.json to empty json in case it encounters any error, and then next time it does a full load since it forgets the last bookmark. I am using s3 as the state backend. What could be wrong here ?

edgar_ramirez_mondragon

09/04/2023, 7:20 PM

Hi @silverbullet1! Can you share some details about your pipeline? • Operating System • Meltano version (

meltano --version

) • Tap and target if they're public (in the Hub or a public repo) • The command that failed

silverbullet1

09/04/2023, 7:57 PM

@edgar_ramirez_mondragon I am using a dockerized build with base image as python-3.11. Meltano version is 2.19.1. Target is postgres and Tap is redshift(custom sdk connector). The command that failed was a simple meltano run. The pipeline was working fine initially but all of a sudden we got this error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 122: unexpected end of data"...and then my state.json became empty. Next runs did a full load of data and failed because of resource scarcity(I got some -9 ec on output). Is this some issue with the way tap/target handles state or in general meltano behaviour? This might be a critical bug for us 😕

silverbullet1

09/04/2023, 8:06 PM

Full error: [2023-09-01, 093137 UTC] {pod_launcher.py:159} INFO - [2m2023-09-01T093137.878597Z[0m [[32m[1minfo [0m] [1mUnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 122: unexpected end of data[0m [36mcmd_type[0m=[35melb[0m [36mconsumer[0m=[35mFalse[0m [36mname[0m=[35mtap-redshift[0m [36mproducer[0m=[35mTrue[0m [36mstdio[0m=[35mstderr[0m [36mstring_id[0m=[35mtap-redshift[0m Regardless of any error, i would never want my state to be reset.

user

09/04/2023, 10:45 PM

Thanks for the details. I don't think this is a widespread bug, or at least I can't replicate with other SDK-based taps and target-jsonl. A failure in the tap doesn't produce an empty state at the end of the pipeline so I think it's either a very specific error in the tap, or the tap emitting some weird bytes to stdout. Note that you can recover state from the system database in the

runs

table. Which variant of target-postgres are you using?

silverbullet1

09/05/2023, 3:14 AM

@edgar_ramirez_mondragon thanks for the info. I am using the meltano labs sdk connector for target-postgres. Recovering via system database might be tricky as the db is not stateful..every execution is fresh. I don't think it could be an issue with tap either because i have made very little changes following the tutorial.. nothing related to state as such.

silverbullet1

09/10/2023, 4:24 PM

@edgar_ramirez_mondragon any more thoughts here ?

silverbullet1

09/11/2023, 10:05 AM

Another weird thing is that, when I restored my state.json to last working one, I was not able to reproduce this error, it just ran fine. My current setup is meltano running in airflow, so we have around some retries on job level as well. I can see after first failure itself, the state got emptied... Then subsequent runs try to read empty state and do a full refresh, hence they fail because we don't have that many resources.

Copy code

2023-09-09, 08:03:01 UTC] {pod_launcher.py:159} INFO - [2m2023-09-09T08:03:01.405343Z[0m [[32m[1minfo     [0m] [1msmart_open.s3.MultipartWriter('mybucket', 'meltano-state/dev:tap-redshift-to-target-postgres/lock'): uploading part_num: 1, 17 bytes (total 0.000GB)[0m
[2023-09-09, 08:03:01 UTC] {pod_launcher.py:159} INFO - [2m2023-09-09T08:03:01.552417Z[0m [[32m[1minfo     [0m] [1msmart_open.s3.MultipartWriter('mybucket', 'meltano-state/dev:tap-redshift-to-target-postgres/state.json'): uploading part_num: 1, 50 bytes (total 0.000GB)[0m
[2023-09-09, 08:03:01 UTC] {pod_launcher.py:159} INFO - [2m2023-09-09T08:03:01.837343Z[0m [[32m[1minfo     [0m] [1mIncremental state has been updated at 2023-09-09 08:03:01.837211.[0m
[2023-09-09, 08:03:01 UTC] {pod_launcher.py:159} INFO - [2m2023-09-09T08:03:01.852097Z[0m [[31m[1merror    [0m] [1mExtractor failed[0m
[2023-09-09, 08:03:01 UTC] {pod_launcher.py:159} INFO - [2m2023-09-09T08:03:01.852510Z[0m [[31m[1merror    [0m] [1mBlock run completed.          [0m [36mblock_type[0m=[35mExtractLoadBlocks[0m [36merr[0m=[35mRunnerError('Extractor failed')[0m [36mexit_codes[0m=[35m{<PluginType.EXTRACTORS: 'extractors'>: -9}[0m [36mset_number[0m=[35m0[0m [36msuccess[0m=[35mFalse[0m
[

silverbullet1

09/11/2023, 10:06 AM

i can handle this via a hack, by backing up state.json before running and on failure upload it back, but trying to understand root cause here

user

09/11/2023, 3:57 PM

The way I'd try to reproduce this is by sending an invalid message after the last state is sent by the tap, e.g.

Copy code

{"type": "SCHEMA", "schema": ...}
{"type": "RECORD", "record": ...}
{"type": "STATE", ...}
Not valid JSON

If that results in the target generating a bad bookmark, then it's definitely a bug in the target (and maybe the tap if it's outputting a similar invalid line under some circumstances)

silverbullet1

09/12/2023, 9:34 AM

how can I simulate a message? and why would the message get corrupt ?

user

09/12/2023, 1:49 PM

e.g. by calling

print(...)

in your tap

silverbullet1

09/21/2023, 6:26 PM

I think I found the issue. It has something to do with the initial connection establishment with redshift. I noticed this behaviour on local as well...intermittently getting

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 122: unexpected end of data

. Just to clarify, I did not have any state in my local. Here is the complete log:

silverbullet1

09/21/2023, 6:26 PM

Logs

user

09/21/2023, 9:43 PM

Ok, that I can replicate

Copy code

>>> str(b"\xf0", "utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: unexpected end of data

https://github.com/aws/amazon-redshift-python-driver/blob/a898554b4dde2cdc6a1c6214eb43b92e36540bcf/redshift_connector/utils/type_utils.py#L60-L61

silverbullet1

09/21/2023, 10:09 PM

Dang! So what would be the easiest fix for this 🤔

silverbullet1

09/22/2023, 5:42 AM

Looks like an issue is already open on their repo. https://github.com/aws/amazon-redshift-python-driver/issues/157

silverbullet1

09/22/2023, 10:53 AM

They mentioned that the latest redshift patch should fix it but I don't think it is. The other alternative is to add retry. This can be added in 2 places, first is in meltano SDK functions or tap code...but the problem is that the connection can fail anywhere. Sometimes it happens when running

discovery

, other times while fetching table from connector. We will have to patch at a lot of places. The other place is on the job level, which we already have. The only problem which is not clear to me is why the state is being written to

{}

when this problem occurs. If that is fixed then job level retry will fix the problem!

edgar_ramirez_mondragon

09/22/2023, 5:59 PM

So it fails even on Redshift 1.0.54899+?

silverbullet1

09/22/2023, 6:27 PM

Yup. People have just implemented retry logic so maybe they haven't noticed. But I got the error intermittently today too.

silverbullet1

09/22/2023, 8:43 PM

What do you suggest as the next course of action?

edgar_ramirez_mondragon

09/23/2023, 1:14 AM

I personally need to give https://docs.sqlalchemy.org/en/20/faq/connections.html#how-do-i-retry-a-statement-execution-automatically a good read. Could be a good solution somewhere in there.

Open in Slack

Previous Next