we’ve got around 2 tb in a postgres DB right now, ...
# getting-started
s
we’ve got around 2 tb in a postgres DB right now, and a co-worker is worried about the stability/maturity of tap-postgres (would be using either full replication or incremental replication). Can anyone speak to this project?
a
Hi, @steve_ivy. On the plus side, there are actually a few different forks of tap-postgres which you can pull from which are fairly mature. Certainly the volumes you mention are doable. As I see it your biggest challenges would be the following: 1. Determine incremental strategy. Aka, can you support log-based replication or do you need ? a. You'll have difficulties with traditional incremental (column-based) replication because inevitably at some point you'll have someone upstream "fix" a column or perform dba maintenance that misses your
updated_on
timestamp columns. (Some tables might not have them.) You can avoid future headaches by starting with a plan that includes at least some tables synced via log-based replication. 2. Streamed vs Batch. Do daily incremental volumes fit in your daily processing window using a streaming-records-from-select approach - or do you need (as of now, upcoming) batch functionality to reach your daily SLAs? 3. Initial Backfill. Same questions for (2) in the initial sync - that much data flowing record-by-record will take a while to do the initial backfill. It's certainly not an unreasonable size DB but it'll take some patience on the first run. 4. Target Loader Speed. Can you say what is your target database? Is that Redshift? There are a few different forks of Redshift targets and some may be more performant than others. If you do experience slowness, keep in mind that you might have loader slowness. (Might not become an issue at all, but bringing it up here for completeness.)
s
Hi @aaronsteers! good comments, let me address them in order…
1. We are trying to avoid log replication for a few reasons, however, my plan was to setup multiple taps for subsets of tables rather than always trying to sync everything. And yes, we have tables without timestamps that are a real pain.
2. currently, using subsets of tables does fit within our processing windows. I’m not familiar with “streamed-from-select” as a term so not sure how to answer you there
skipping to 4. we’re using snowflake which has been VERY positive for us vs. Redshift
and 3. yeah, we’re doing a mix of windowed backfills plus rabbit mq streaming (from an app) right now for our largest table, but our DBA has brought up log-based replication again lately, so…
where would I look for a list of forks of tap-postgres?
a
Nice, @steve_ivy. Great to see you've already got some momentum and what sounds like a good path forward!
I’m not familiar with “streamed-from-select” as a term so not sure how to answer you there
I just made that up, sorry. Just the difference between sending records downstream via the tried-and-true
select .. from ...
versus it's more modern file-based alternatives
unload/copy from ... to ...
. If a normal select isn't fast enough, the bulk export path is something we're looking at for sources which can accept it.
s
ah gotcha. our current tools do a copy to s3, then copy into in snowflake
that’s definitely a feature we’d be interested in
a
Do you know if they are using pgloader or similar, or some other method?
where would I look for a list of forks of tap-postgres?
The hub has a listing of known taps and variants/forks. I see the recommended/default one for postgres is pipelinewise, with Singer having one also.
s
our current exporter is a bit hacky but optimized to reduce file sizes - we do selects and write out json arrays (rather than json object records) to files in s3
we might have moved away from
copy into
for snowflake, that code has been in flux recently