When using postgres tap and target transferwise with FULL TA Meltano #plugins-general

When using postgres tap and target (transferwise) ...

mark_poole

03/08/2021, 9:02 PM

When using postgres tap and target (transferwise) with FULL_TABLE sync. I'm realizing it's keeping all the data every time the sync happens. Is there some best practice to keep around the last few syncs or remove old data? I can't use log based because GCP CloudSQL and need to get all the columns in for Incremental. I tried looking through the meltano docs and stitchdata https://www.stitchdata.com/docs/replication/replication-methods/full-table#limitations but couldn't find anything.

douwe_maan

03/08/2021, 10:09 PM

@mark_poole Do your source tables have primary keys that could be used in the target to UPDATE existing rows instead of always INSERTing new ones?

mark_poole

03/08/2021, 10:11 PM

@douwe_maan the majority if not all of the tables have primary keys

douwe_maan

03/08/2021, 10:15 PM

Are those primary keys making it into the tables created in the destination DB? Or do we end up with duplicate records there as the graph suggests?

mark_poole

03/08/2021, 10:40 PM

Copy code

SELECT
    pg_database.datname,
    pg_size_pretty(pg_database_size(pg_database.datname)) AS size
    FROM pg_database;

mark_poole

03/08/2021, 10:40 PM

shows something completely different

mark_poole

03/08/2021, 10:41 PM

I'm going to raise a gcp support ticket, their UI says 1tb in use, the psql server (via that command) shows <50G

mark_poole

03/08/2021, 11:02 PM

sorry about bringing that here first, I couldn't find duplicate rows and the tables match in size closely

douwe_maan

03/08/2021, 11:03 PM

No worries, glad it doesn't appear to be a real issue with the tap or target!

mark_poole

03/09/2021, 2:01 AM

Anyone else seeing this, GCP uses WAL for recovery which takes up a ton of space if you are re-writing your largest tables to the DB once an hour 🙂

aaronsteers

03/10/2021, 7:13 PM

@mark_poole - I haven’t with GCP specifically, but in other platforms, yes, this definitely comes up often. For Snowflake, for instance, we would change the retention time (aka “time travel”) to zero or 24 hours in order to reduce the redundant disk space consumption. Do you know if GCP has any similar configurability?

aaronsteers

03/10/2021, 7:15 PM

Maybe this?https://cloud.google.com/sql/docs/mysql/backup-recovery/pitr#set-retention

mark_poole

03/10/2021, 10:28 PM

Yes, I think that would cover it. I turned it off because i'm happy with daily backups only in development and there is a dedicated analytics server in production that can be rebuilt from other data automatically (removing the need for point in time backups)

mark_poole

03/10/2021, 10:29 PM

Appreciate your help

aaronsteers

03/10/2021, 10:31 PM

Happy to help! And I wanted to better understand the GCP side anyway, so thank you for circling back to confirm.

Open in Slack

Previous Next