Hi We have a few large Postgres databases about 2TB each and Meltano #best-practices

Hi. We have a few large Postgres databases (about ...

dylan_just

08/07/2022, 7:26 AM

Hi. We have a few large Postgres databases (about 2TB each) and are trying to import them into RedShift using Meltano. We're currently using pipelinewise-tap-postgres in incremental mode against a read replica. This is proving slow, and causing some issues with the read replica. We keep hitting the

max_standby_archive_delay/max_standby_streaming_delay

limit. While we can increase this, it's not really desirable. Furthermore, the Meltano run doesn't seem to give the db replication a chance to catch up - in monitoring, the streaming delay constantly increases for the duration of the meltano run. I'm wondering if anyone has any advice on the best way to do get the initial data import happening in our scenario?

andrey_tatarinov

08/09/2022, 8:44 AM

It will take literally forever 😞

andrey_tatarinov

08/09/2022, 8:48 AM

Currently Singer is extremely slow for large datasets, 2Tb is extremely large data for for this.

andrey_tatarinov

08/09/2022, 8:49 AM

There's a suggestion for BATCH records, which should improve throughput, but it is not implemented yet.

andrey_tatarinov

08/09/2022, 8:50 AM

I hate to say this, but in your case best solution would be to use pipelinewise, they have shortcut fastsync implementation for initial load.

andrey_tatarinov

08/09/2022, 8:50 AM

After that with some trickery it is possible to move synchronisation state to Meltano and continue to do incremental syncs with it.

edgar_ramirez_mondragon

08/09/2022, 2:04 PM

Gotta mention that we’ll start implementing BATCH messages in the SDK this week 😄

andrey_tatarinov

08/09/2022, 7:45 PM

@edgar_ramirez_mondragon oh nice! I would love to help/contribute

dylan_just

08/11/2022, 9:59 AM

One problem with pipelinewise-tap-postgres is that it runs one query over the table and iterates over the whole thing with a cursor. If it just had a LIMIT and looped over the query, it'd be much better.

jonas_kalderstam

09/29/2022, 4:43 PM

cool idea with the limit. I might just implement that in my fork of pipelinewise-tap-postgres.

jonas_kalderstam

09/29/2022, 4:46 PM

we've been having similar issues with our replica

jonas_kalderstam

09/29/2022, 8:21 PM

https://github.com/spacecowboy/pipelinewise-tap-postgres/pull/1 Not sure if it will behave better with the read replica yet however

dylan_just

09/29/2022, 11:07 PM

Yeah, that looks pretty cool.

dylan_just

09/29/2022, 11:17 PM

I suspect that some people may still want the old behavior to be available because it's more transactional - gets all the rows at once. I wonder if there are any edge cases with the limit and offset (e.g. if the underlying rows change, or the order isn't stable, or the db connection drops out or errors) 🤔 . If there are more than

itersize

rows that match a single bookmark value, then singlepage will never progress. e.g. say itersize is 10k and my incremental state field is "last_updated" and there are 20k records with the same last_updated value, then singlepage will just get the first 10k each time. One of my workmates foresaw this and suggested adding a secondary key (e.g. a primary key) so that the bookmark is always a unique point. But I think a scenario like this would be rare. But, at this stage, your code looks pretty much exactly how I imagined it. Nice work.

jonas_kalderstam

09/30/2022, 6:49 PM

I gave it a shot and noticed no improvement sadly 😞 so this particular implementation is not the way to go

Open in Slack

Previous Next