Hi. We have a few large Postgres databases (about ...
# best-practices
d
Hi. We have a few large Postgres databases (about 2TB each) and are trying to import them into RedShift using Meltano. We're currently using pipelinewise-tap-postgres in incremental mode against a read replica. This is proving slow, and causing some issues with the read replica. We keep hitting the
max_standby_archive_delay/max_standby_streaming_delay
limit. While we can increase this, it's not really desirable. Furthermore, the Meltano run doesn't seem to give the db replication a chance to catch up - in monitoring, the streaming delay constantly increases for the duration of the meltano run. I'm wondering if anyone has any advice on the best way to do get the initial data import happening in our scenario?
a
It will take literally forever 😞
Currently Singer is extremely slow for large datasets, 2Tb is extremely large data for for this.
There's a suggestion for BATCH records, which should improve throughput, but it is not implemented yet.
I hate to say this, but in your case best solution would be to use pipelinewise, they have shortcut fastsync implementation for initial load.
After that with some trickery it is possible to move synchronisation state to Meltano and continue to do incremental syncs with it.
e
Gotta mention that we’ll start implementing BATCH messages in the SDK this week 😄
a
@edgar_ramirez_mondragon oh nice! I would love to help/contribute
d
One problem with pipelinewise-tap-postgres is that it runs one query over the table and iterates over the whole thing with a cursor. If it just had a LIMIT and looped over the query, it'd be much better.
j
cool idea with the limit. I might just implement that in my fork of pipelinewise-tap-postgres.
we've been having similar issues with our replica
https://github.com/spacecowboy/pipelinewise-tap-postgres/pull/1 Not sure if it will behave better with the read replica yet however
d
Yeah, that looks pretty cool.
I suspect that some people may still want the old behavior to be available because it's more transactional - gets all the rows at once. I wonder if there are any edge cases with the limit and offset (e.g. if the underlying rows change, or the order isn't stable, or the db connection drops out or errors) 🤔 . If there are more than
itersize
rows that match a single bookmark value, then singlepage will never progress. e.g. say itersize is 10k and my incremental state field is "last_updated" and there are 20k records with the same last_updated value, then singlepage will just get the first 10k each time. One of my workmates foresaw this and suggested adding a secondary key (e.g. a primary key) so that the bookmark is always a unique point. But I think a scenario like this would be rare. But, at this stage, your code looks pretty much exactly how I imagined it. Nice work.
j
I gave it a shot and noticed no improvement sadly 😞 so this particular implementation is not the way to go