Hi everyone I wanted to drop a learning here that had us con Meltano #troubleshooting

Hi everyone, I wanted to drop a learning here that...

jobert_abma

01/06/2022, 5:54 PM

Hi everyone, I wanted to drop a learning here that had us confused earlier this week: we use logical replication with wal2json and pipelinewise-tap-postgres to replicate an RDS instance into Snowflake and saw text columns disappearing in Snowflake as soon as the row was updated in the RDS instance. It turned out that wal2json doesn’t always include large(r) text blobs in WAL. When the wal2json message is processed by singer / the meltano pipeline, it sets the column to NULL instead of ignoring updating this. For anyone that wants to read more about this: https://github.com/eulerto/wal2json/pull/99 and https://www.postgresql.org/docs/9.5/storage-toast.html.

jozef_reginac

04/11/2022, 11:21 AM

Hey Jobert, thanks a lot for the message. I just debugged our issue using your post. Have you fixed this somehow? Only thing I think of now is to switch to incremental load since the issue on wal2json is not address for like 3 years.

jobert_abma

04/11/2022, 5:37 PM

@jozef_reginac Glad we’re not the only ones that ran into this 😅 @martijn_wouters and @maarten_van_gijssel can speak to what we did to solve it. I do know that we’re running a fork of the tap right now, but they know all the details. @maarten_van_gijssel is currently on PTO, so perhaps @martijn_wouters can pitch in.

maarten_van_gijssel

04/19/2022, 1:52 PM

@jozef_reginac we’ve experimented with a couple of different changes. Currently it’s a change in the target https://github.com/transferwise/pipelinewise-target-snowflake/compare/master...Hacker0x01:master

maarten_van_gijssel

04/19/2022, 1:54 PM

The idea behind the change is that we create separate “buckets” for the same resource based on the different possible schemas.

maarten_van_gijssel

04/19/2022, 1:56 PM

tap-postgres

is sending over records to

target-snowflake

where sometimes fields are missing due to the TOAST mechanism

maarten_van_gijssel

04/19/2022, 1:57 PM

We generate possible schemas based on the different sets of fields within the same resource

maarten_van_gijssel

04/19/2022, 1:58 PM

This works only reliably using the Parquet file format. Using the csv file format will result in errors that the generated csv will have less columns than the target table

maarten_van_gijssel

04/19/2022, 1:59 PM

Let me know if that makes sense! Happy to explain in more detail if needed 🙂

4 Views

Open in Slack

Previous Next