I m copying lots of data from Postgres to Redshift Any advic Meltano #troubleshooting

Join Slack

I'm copying lots of data from Postgres to Redshift...

# troubleshooting

dylan_just

08/28/2024, 10:12 PM

I'm copying lots of data from Postgres to Redshift. Any advice on how to speed this up? It's taking a long time.

Edgar Ramírez (Arch.dev)

08/29/2024, 12:22 AM

Might be worth trying BATCH messages: https://sdk.meltano.com/en/latest/batch.html

dylan_just

08/29/2024, 12:29 AM

Yep. I don't think target-redshift supports them, though. I'm not very good at Python, so I'm not sure I'd be able to contribute this. Have suggested to the maintainer. Any other ideas?

visch

08/29/2024, 2:37 PM

break down what exactly is taking a while. Normally it's a few tables so you could split out those individual tables into seperate runs and run those in parallel

Charles Feduke

08/29/2024, 2:38 PM

how many nodes in your redshift cluster?

dylan_just

09/01/2024, 9:41 PM

The speed issue just in initial ingestion. I've got about 10 large tables with a few million rows each. Redshift cluster is 32 RPU serverless.

dylan_just

09/01/2024, 9:43 PM

From the logs, it looks like the time from when it says "sink full" until it's merged into Redshift seems pretty quick - within a second or two.

Charles Feduke

09/02/2024, 3:47 PM

the flow should look something like this behind the scenes: tap-postgres should be extracting data as a single file. Because the tap and target are single instances you’re limited by network throughput between Postgres and the worker and the worker and S3, whatever that happens to be (so - run on a machine within AWS with good network connectivity). What I am unsure of here is whether you can configure the level of parallelism so you’re outputting to N separate files, where N is the equivalent number of nodes (I don’t know how 32 RPU translates to nodes, I’m a 2014 big data guy, we communicate in nodes). I imagine because tap-postgres has no idea about target-redshift its just creating a single dump, and target-redshift is - at least the ticketswap variant - is also working with only a single file which it copies to S3 before performing a Redshift

copy

statement. For really large tables if you’re not splitting the data into multiple files you are getting no parallelism on unload/load and therefore limited benefits from having a multi-node Redshift cluster. (Again, this is mid-stage big data stuff, the things that Spark and Hadoop are really good at, and something Redshift supported.)

Charles Feduke

09/02/2024, 3:48 PM

So you can tweak things but ultimately you’re limited to single node performance in Redshift.

Charles Feduke

09/02/2024, 3:54 PM

looking at the transferwise target-redshift I see they define a

max_parallelism

property and set it to 1, and I remember going through this source code before… I don’t think the target does anything with that value (ideally it’d break up the file from the tap into N files).

Charles Feduke

09/02/2024, 4:00 PM

even if you do break things up into N files you’re still bottlenecked by whatever the single tap’s performance is reading from Postgres - if its a one time load, okay just bite the bullet and deal with the extended wall clock run time, not really worth engineering around. If this is a daily full table load and has an SLA you’re not meeting you’ll need to parallelize using something like Spark (Amazon’s EMR would handle this use case very well).

dylan_just

09/02/2024, 9:00 PM

Thanks for the insight there. Really appreciate it.

4 Views

Open in Slack

Previous Next