Hello I ve been using Meltano for a couple of months with a Meltano #troubleshooting

Hello! I've been using Meltano for a couple of mon...

hans_ienasescu

11/23/2022, 6:10 AM

Hello! I've been using Meltano for a couple of months with a number of different extractors and usually

target-postgres

as the loader. I'm currently in a situation where I extract data from Amazon Athena and load into Postgres. The Athena data doesn't really have a nice one column primary key, but overall the rows are unique. As a result in the Postgres loader I have

primary_key_required: false

. I only have one table. My problem is that if I run the Athena extractor + Postgres loader multiple times it duplicates the entire dataset and adds the same data again and again to the table. I've been through the documentation and I cannot find a solution to my problem to not have data duplicated every time I run Meltano. I hoped there would be a simple solution in the target to just say "refresh aka truncate the table before inserting new rows", but perhaps that's the wrong way to think about it. Any thoughts / tips on how to "refresh" the data in the target every time? Thanks

christoph

11/23/2022, 6:21 AM

My initial reaction would be to not delete any data. What I would do in this scenario is to keep that data from every EL job run. For

target-posgtes

in the pipelinewise variant, I enable the

add_metadata_columns

option, so that I know which rows have been loaded at what point in time. I would then build the appropriate dbt models on top of that loaded data to do the required analysis.

hans_ienasescu

11/23/2022, 6:50 AM

Thanks @christoph, the issue is that the data doesn't really change that much, so I would be duplicating most if not all of it every time. I don't really want an incremental load, just to overwrite everything every time I run Meltano. Given also that there isn't a nice one or two column key to look at it would be hard even in dbt to identify duplicates without comparing all columns. I am quite surprised that there is no option to have a fresh load every time. If I enable the

add_metadata_columns

I could use dbt to only take the rows that have been added in the current EL run, but my source table created by Meltano would grow and grow ... and I would be forced to truncate it eventually as the previously loaded data isn't useful in any way

abhishek_ajmera

11/23/2022, 6:55 AM

the data doesn’t really change that much, so I would be duplicating most if not all of it every time

If that’s the case, wouldn’t it be better for you to replicate only the data that has changed via an incremental update? Next what we do with our data is run a scheduled query periodically to drop old duplicate rows from the table.

hans_ienasescu

11/23/2022, 7:02 AM

@abhishek_ajmera Yes that would be a solution, but the data I'm working with is around 50K rows which isn't that much to load all at once. I don't really want to introduce a lot of complexity to handle incremental loading and also having to periodically clean up. As I see it your suggestion would be a good solution for large volumes of not-so-often-changing data where there is a big cost of processing all of it every time. In my case I want to keep it simple, refresh the data every time. Kinda like what dbt is doing by default. Thanks

christoph

11/23/2022, 7:17 AM

Makes perfect sense. It sounds like you have a very good case for simply truncating the target table on every load. I can't think of a surefire way to implement this with current capabilities. I'm thinking that running a 'utility' that performs the truncate prior to the

tap-athena target-postgres

block might be a way, but I'm not sure how that helper utility would know which table in which schema in which postgres cluster to truncate.

christoph

11/23/2022, 7:21 AM

You can read up on the topic in this thread as well. https://meltano.slack.com/archives/C01UTUSP34M/p1668706145001149

hans_ienasescu

11/23/2022, 7:42 AM

Thanks @christoph, will look into it

abhishek_ajmera

11/23/2022, 8:02 AM

@hans_ienasescu I haven’t used postgres loader, but in BigQuery loader there is another configuration of replication method, which can be append or truncate. The postgres repo has functions to drop data from table (which would be called before inserting new data) as well as to upsert data (done via upsert query). As per their repo,

A full list of supported settings and capabilities is available by running:
target-postgres --about

Maybe check that out as well, might find the config you need.

hans_ienasescu

11/23/2022, 8:24 AM

@abhishek_ajmera The replication_method of

truncate

is exactly what I would need, but unfortunately that's not something

target-postgres

has Could you post the link of the text you quoted above? I can't seem to find it , thanks!

abhishek_ajmera

11/23/2022, 8:25 AM

Sure. Just above installation.

hans_ienasescu

11/23/2022, 8:33 AM

oooh ok, this is the MeltanoLabs version of target-postgres, I was looking at the one from transferwise

abhishek_ajmera

11/23/2022, 8:38 AM

Oh sorry, my bad. I just went to the first search result. 😅

hans_ienasescu

11/23/2022, 10:45 AM

@abhishek_ajmera I installed the MeltanoLabs version of target-postgres and there are no options to drop data from the table. You wrote above: "_The postgres repo has functions to drop data from table_" , what did you mean by this?

abhishek_ajmera

11/23/2022, 10:49 AM

The codebase had functions written for it, not sure about the commands or configuration required to invoke those functions

hans_ienasescu

11/23/2022, 11:28 AM

@abhishek_ajmera I looked at the codebase as well, there is a drop_table function, probably will be used to delete some temporary tables.

hans_ienasescu

11/23/2022, 11:54 AM

I ran Meltano using tap-csv and target-postgres with a small 10 row CSV file: • first run, I have 10 rows in the table , all good • I delete 3 rows in the middle of the CSV file, run meltano again, look at the table, still 10 rows How is a fresh load not a thing I do not understand...

thomas_briggs

11/23/2022, 2:00 PM

I haven't tried it, but I wonder if you could accomplish this using a dbt post_hook. Set the model materialization to table, such that it gets rebuilt each time; then in the post_hook delete all the rows from the source table. 🤨 🤔

aaronsteers

11/23/2022, 5:02 PM

There's an unofficial but still widely implemented ACTIVATE_VERSION feature which may help...

aaronsteers

11/23/2022, 5:03 PM

This came up just last week in another thread. https://meltano.slack.com/archives/C01UTUSP34M/p1668794470375339?thread_ts=1668706145.001149&cid=C01UTUSP34M

hans_ienasescu

11/29/2022, 12:33 PM

Thanks all for the help, I've decided to write my own Python scripts to truncate all the sources before I run Meltano. There is also the

datamill-co

version of target-postgres that allows for execution of SQL statements at connection time and before the load is run. One can truncate / drop things there as well, but the

datamill-co

is missing other things. I've also created a feature request: https://github.com/transferwise/pipelinewise-target-postgres/issues/109

Open in Slack

Previous Next