Hello! I've been using Meltano for a couple of mon...
# troubleshooting
h
Hello! I've been using Meltano for a couple of months with a number of different extractors and usually
target-postgres
as the loader. I'm currently in a situation where I extract data from Amazon Athena and load into Postgres. The Athena data doesn't really have a nice one column primary key, but overall the rows are unique. As a result in the Postgres loader I have
primary_key_required: false
. I only have one table. My problem is that if I run the Athena extractor + Postgres loader multiple times it duplicates the entire dataset and adds the same data again and again to the table. I've been through the documentation and I cannot find a solution to my problem to not have data duplicated every time I run Meltano. I hoped there would be a simple solution in the target to just say "refresh aka truncate the table before inserting new rows", but perhaps that's the wrong way to think about it. Any thoughts / tips on how to "refresh" the data in the target every time? Thanks
c
My initial reaction would be to not delete any data. What I would do in this scenario is to keep that data from every EL job run. For
target-posgtes
in the pipelinewise variant, I enable the
add_metadata_columns
option, so that I know which rows have been loaded at what point in time. I would then build the appropriate dbt models on top of that loaded data to do the required analysis.
h
Thanks @christoph, the issue is that the data doesn't really change that much, so I would be duplicating most if not all of it every time. I don't really want an incremental load, just to overwrite everything every time I run Meltano. Given also that there isn't a nice one or two column key to look at it would be hard even in dbt to identify duplicates without comparing all columns. I am quite surprised that there is no option to have a fresh load every time. If I enable the
add_metadata_columns
I could use dbt to only take the rows that have been added in the current EL run, but my source table created by Meltano would grow and grow ... and I would be forced to truncate it eventually as the previously loaded data isn't useful in any way
a
the data doesn’t really change that much, so I would be duplicating most if not all of it every time
If that’s the case, wouldn’t it be better for you to replicate only the data that has changed via an incremental update? Next what we do with our data is run a scheduled query periodically to drop old duplicate rows from the table.
h
@abhishek_ajmera Yes that would be a solution, but the data I'm working with is around 50K rows which isn't that much to load all at once. I don't really want to introduce a lot of complexity to handle incremental loading and also having to periodically clean up. As I see it your suggestion would be a good solution for large volumes of not-so-often-changing data where there is a big cost of processing all of it every time. In my case I want to keep it simple, refresh the data every time. Kinda like what dbt is doing by default. Thanks
c
Makes perfect sense. It sounds like you have a very good case for simply truncating the target table on every load. I can't think of a surefire way to implement this with current capabilities. I'm thinking that running a 'utility' that performs the truncate prior to the
tap-athena target-postgres
block might be a way, but I'm not sure how that helper utility would know which table in which schema in which postgres cluster to truncate.
You can read up on the topic in this thread as well. https://meltano.slack.com/archives/C01UTUSP34M/p1668706145001149
h
Thanks @christoph, will look into it
a
@hans_ienasescu I haven’t used postgres loader, but in BigQuery loader there is another configuration of replication method, which can be append or truncate. The postgres repo has functions to drop data from table (which would be called before inserting new data) as well as to upsert data (done via upsert query). As per their repo,
A full list of supported settings and capabilities is available by running:
target-postgres --about
Maybe check that out as well, might find the config you need.
h
@abhishek_ajmera The replication_method of
truncate
is exactly what I would need, but unfortunately that's not something
target-postgres
has Could you post the link of the text you quoted above? I can't seem to find it , thanks!
a
Sure. Just above installation.
h
oooh ok, this is the MeltanoLabs version of target-postgres, I was looking at the one from transferwise
a
Oh sorry, my bad. I just went to the first search result. 😅
h
@abhishek_ajmera I installed the MeltanoLabs version of target-postgres and there are no options to drop data from the table. You wrote above: "_The postgres repo has functions to drop data from table_" , what did you mean by this?
a
The codebase had functions written for it, not sure about the commands or configuration required to invoke those functions
h
@abhishek_ajmera I looked at the codebase as well, there is a drop_table function, probably will be used to delete some temporary tables.
I ran Meltano using tap-csv and target-postgres with a small 10 row CSV file: • first run, I have 10 rows in the table , all good • I delete 3 rows in the middle of the CSV file, run meltano again, look at the table, still 10 rows How is a fresh load not a thing I do not understand...
t
I haven't tried it, but I wonder if you could accomplish this using a dbt post_hook. Set the model materialization to table, such that it gets rebuilt each time; then in the post_hook delete all the rows from the source table. 🤨 🤔
a
There's an unofficial but still widely implemented ACTIVATE_VERSION feature which may help...
h
Thanks all for the help, I've decided to write my own Python scripts to truncate all the sources before I run Meltano. There is also the
datamill-co
version of target-postgres that allows for execution of SQL statements at connection time and before the load is run. One can truncate / drop things there as well, but the
datamill-co
is missing other things. I've also created a feature request: https://github.com/transferwise/pipelinewise-target-postgres/issues/109