hans_ienasescu
11/23/2022, 6:10 AMtarget-postgres as the loader.
I'm currently in a situation where I extract data from Amazon Athena and load into Postgres.
The Athena data doesn't really have a nice one column primary key, but overall the rows are unique. As a result in the Postgres loader I have primary_key_required: false. I only have one table.
My problem is that if I run the Athena extractor + Postgres loader multiple times it duplicates the entire dataset and adds the same data again and again to the table.
I've been through the documentation and I cannot find a solution to my problem to not have data duplicated every time I run Meltano.
I hoped there would be a simple solution in the target to just say "refresh aka truncate the table before inserting new rows", but perhaps that's the wrong way to think about it.
Any thoughts / tips on how to "refresh" the data in the target every time? Thankschristoph
11/23/2022, 6:21 AMtarget-posgtes in the pipelinewise variant, I enable the add_metadata_columns option, so that I know which rows have been loaded at what point in time.
I would then build the appropriate dbt models on top of that loaded data to do the required analysis.hans_ienasescu
11/23/2022, 6:50 AMadd_metadata_columns I could use dbt to only take the rows that have been added in the current EL run, but my source table created by Meltano would grow and grow ... and I would be forced to truncate it eventually as the previously loaded data isn't useful in any wayabhishek_ajmera
11/23/2022, 6:55 AMthe data doesn’t really change that much, so I would be duplicating most if not all of it every timeIf that’s the case, wouldn’t it be better for you to replicate only the data that has changed via an incremental update? Next what we do with our data is run a scheduled query periodically to drop old duplicate rows from the table.
hans_ienasescu
11/23/2022, 7:02 AMchristoph
11/23/2022, 7:17 AMtap-athena target-postgres block might be a way, but I'm not sure how that helper utility would know which table in which schema in which postgres cluster to truncate.christoph
11/23/2022, 7:21 AMhans_ienasescu
11/23/2022, 7:42 AMabhishek_ajmera
11/23/2022, 8:02 AMA full list of supported settings and capabilities is available by running:Maybe check that out as well, might find the config you need.target-postgres --about
hans_ienasescu
11/23/2022, 8:24 AMtruncate is exactly what I would need, but unfortunately that's not something target-postgres has
Could you post the link of the text you quoted above? I can't seem to find it , thanks!abhishek_ajmera
11/23/2022, 8:25 AMhans_ienasescu
11/23/2022, 8:33 AMabhishek_ajmera
11/23/2022, 8:38 AMhans_ienasescu
11/23/2022, 10:45 AMabhishek_ajmera
11/23/2022, 10:49 AMhans_ienasescu
11/23/2022, 11:28 AMhans_ienasescu
11/23/2022, 11:54 AMthomas_briggs
11/23/2022, 2:00 PMaaronsteers
11/23/2022, 5:02 PMaaronsteers
11/23/2022, 5:03 PMhans_ienasescu
11/29/2022, 12:33 PMdatamill-co version of target-postgres that allows for execution of SQL statements at connection time and before the load is run. One can truncate / drop things there as well, but the datamill-co is missing other things.
I've also created a feature request:
https://github.com/transferwise/pipelinewise-target-postgres/issues/109