I'm copying lots of data from Postgres to Redshift...
# troubleshooting
d
I'm copying lots of data from Postgres to Redshift. Any advice on how to speed this up? It's taking a long time.
e
Might be worth trying BATCH messages: https://sdk.meltano.com/en/latest/batch.html
d
Yep. I don't think target-redshift supports them, though. I'm not very good at Python, so I'm not sure I'd be able to contribute this. Have suggested to the maintainer. Any other ideas?
v
break down what exactly is taking a while. Normally it's a few tables so you could split out those individual tables into seperate runs and run those in parallel
c
how many nodes in your redshift cluster?
d
The speed issue just in initial ingestion. I've got about 10 large tables with a few million rows each. Redshift cluster is 32 RPU serverless.
From the logs, it looks like the time from when it says "sink full" until it's merged into Redshift seems pretty quick - within a second or two.
c
the flow should look something like this behind the scenes: tap-postgres should be extracting data as a single file. Because the tap and target are single instances you’re limited by network throughput between Postgres and the worker and the worker and S3, whatever that happens to be (so - run on a machine within AWS with good network connectivity). What I am unsure of here is whether you can configure the level of parallelism so you’re outputting to N separate files, where N is the equivalent number of nodes (I don’t know how 32 RPU translates to nodes, I’m a 2014 big data guy, we communicate in nodes). I imagine because tap-postgres has no idea about target-redshift its just creating a single dump, and target-redshift is - at least the ticketswap variant - is also working with only a single file which it copies to S3 before performing a Redshift
copy
statement. For really large tables if you’re not splitting the data into multiple files you are getting no parallelism on unload/load and therefore limited benefits from having a multi-node Redshift cluster. (Again, this is mid-stage big data stuff, the things that Spark and Hadoop are really good at, and something Redshift supported.)
So you can tweak things but ultimately you’re limited to single node performance in Redshift.
looking at the transferwise target-redshift I see they define a
max_parallelism
property and set it to 1, and I remember going through this source code before… I don’t think the target does anything with that value (ideally it’d break up the file from the tap into N files).
even if you do break things up into N files you’re still bottlenecked by whatever the single tap’s performance is reading from Postgres - if its a one time load, okay just bite the bullet and deal with the extended wall clock run time, not really worth engineering around. If this is a daily full table load and has an SLA you’re not meeting you’ll need to parallelize using something like Spark (Amazon’s EMR would handle this use case very well).
d
Thanks for the insight there. Really appreciate it.