we ve got around 2 tb in a postgres DB right now and a co wo Meltano #getting-started

we’ve got around 2 tb in a postgres DB right now, ...

steve_ivy

09/13/2021, 3:52 PM

we’ve got around 2 tb in a postgres DB right now, and a co-worker is worried about the stability/maturity of tap-postgres (would be using either full replication or incremental replication). Can anyone speak to this project?

aaronsteers

09/14/2021, 3:02 PM

Hi, @steve_ivy. On the plus side, there are actually a few different forks of tap-postgres which you can pull from which are fairly mature. Certainly the volumes you mention are doable. As I see it your biggest challenges would be the following: 1. Determine incremental strategy. Aka, can you support log-based replication or do you need ? a. You'll have difficulties with traditional incremental (column-based) replication because inevitably at some point you'll have someone upstream "fix" a column or perform dba maintenance that misses your

updated_on

timestamp columns. (Some tables might not have them.) You can avoid future headaches by starting with a plan that includes at least some tables synced via log-based replication. 2. Streamed vs Batch. Do daily incremental volumes fit in your daily processing window using a streaming-records-from-select approach - or do you need (as of now, upcoming) batch functionality to reach your daily SLAs? 3. Initial Backfill. Same questions for (2) in the initial sync - that much data flowing record-by-record will take a while to do the initial backfill. It's certainly not an unreasonable size DB but it'll take some patience on the first run. 4. Target Loader Speed. Can you say what is your target database? Is that Redshift? There are a few different forks of Redshift targets and some may be more performant than others. If you do experience slowness, keep in mind that you might have loader slowness. (Might not become an issue at all, but bringing it up here for completeness.)

steve_ivy

09/14/2021, 3:03 PM

Hi @aaronsteers! good comments, let me address them in order…

steve_ivy

09/14/2021, 3:05 PM

1. We are trying to avoid log replication for a few reasons, however, my plan was to setup multiple taps for subsets of tables rather than always trying to sync everything. And yes, we have tables without timestamps that are a real pain.

steve_ivy

09/14/2021, 3:06 PM

2. currently, using subsets of tables does fit within our processing windows. I’m not familiar with “streamed-from-select” as a term so not sure how to answer you there

steve_ivy

09/14/2021, 3:07 PM

skipping to 4. we’re using snowflake which has been VERY positive for us vs. Redshift

steve_ivy

09/14/2021, 3:09 PM

and 3. yeah, we’re doing a mix of windowed backfills plus rabbit mq streaming (from an app) right now for our largest table, but our DBA has brought up log-based replication again lately, so…

steve_ivy

09/14/2021, 3:09 PM

where would I look for a list of forks of tap-postgres?

aaronsteers

09/14/2021, 3:20 PM

Nice, @steve_ivy. Great to see you've already got some momentum and what sounds like a good path forward!

I’m not familiar with “streamed-from-select” as a term so not sure how to answer you there

I just made that up, sorry. Just the difference between sending records downstream via the tried-and-true

select .. from ...

versus it's more modern file-based alternatives

unload/copy from ... to ...

. If a normal select isn't fast enough, the bulk export path is something we're looking at for sources which can accept it.

steve_ivy

09/14/2021, 3:21 PM

ah gotcha. our current tools do a copy to s3, then copy into in snowflake

steve_ivy

09/14/2021, 3:22 PM

that’s definitely a feature we’d be interested in

aaronsteers

09/14/2021, 3:22 PM

Do you know if they are using pgloader or similar, or some other method?

aaronsteers

09/14/2021, 3:25 PM

where would I look for a list of forks of tap-postgres?

The hub has a listing of known taps and variants/forks. I see the recommended/default one for postgres is pipelinewise, with Singer having one also.

steve_ivy

09/14/2021, 3:32 PM

our current exporter is a bit hacky but optimized to reduce file sizes - we do selects and write out json arrays (rather than json object records) to files in s3

steve_ivy

09/14/2021, 3:33 PM

we might have moved away from

copy into

for snowflake, that code has been in flux recently

Open in Slack

Previous Next