jacob_matson
10/17/2022, 8:36 PMtap-spreadsheets-anywhere
& target-parquet
but it might be more generally applicable. I'm hitting a csv file on the web, where the data inside updates ~once per day. When I execute meltano run tap-spreadsheets-anywhere target-parquet
for a csv file on the web, it fails to find a "modified date", thus executes a full load, duplicating the entire CSV. This does not occur if it is located in an S3 bucket (or on my local machine, anywhere where meltano can find a modified date). In this case (web sourced CSV), I want to execute an upsert instead of an insert. Is there any way to do this?
My current work around is dropping the dataset from the file (adding a dupe check into my transform step), but that is kludgy at best because there is no import date timestamp in the table built by meltano (if there is an import data, I can simply grab the latest record for a given key).visch
10/18/2022, 2:13 AMGET /csvfile
(we'll ignore the header and other bits)
The csv file doesn't have an api that allows you to filter it. That would look something like
GET /data?modifieddategt=yesterday
Therefore your tap is pulling all of the "csv" regardless. There's lots of tricks, but generally a csv doesn't' give you the capacity to do a true "incremental" load.
For s3 you'll see multiple csv files with different modified dates, and you'll query the latest s3 files with a modified date beyond what you're looking for. This works as it's multiple GET /csvfile
requests.
---
to your question, normally for csv files you can still do an upsert by specefiying the primary key for that file.
if it is located in an S3 bucket (or on my local machine, anywhere where meltano can find a modified date)Are you sure about this? It's the file modified date, not the row modified date
visch
10/18/2022, 2:15 AMMy current work around is dropping the dataset from the file (adding a dupe check into my transform step), but that is kludgy at best because there is no import date timestamp in the table built by meltano (if there is an import data, I can simply grab the latest record for a given key).This is what I do fwiw
jacob_matson
10/18/2022, 2:22 AMnormally for csv files you can still do an upsert by specefiying the primary key for that file.will do some testing on this, it could be an issue with the target if the upsert is being dumb (i.e. not handling the key)
Are you sure about this? It's the file modified date, not the row modified dateright, file modified date. sorry if that was unclear in regards to the work around, is there anyway to add a column that holds "last import data" by row? I swear meltano used to include that default but i can't find it.
jacob_matson
10/18/2022, 2:45 AMvisch
10/18/2022, 1:06 PMin regards to the work around, is there anyway to add a column that holds "last import data" by row? I swear meltano used to include that default but i can't find it.Yes, it's per tap. Most taps implement
_sdc_*
metadata columns. _sdc_extracted_at
might be what you're looking for here
I guess I can implement a "source pruner" as a pre-hook in dbt but don't feel great about it tbhFor a single csv that you want to import over and over it's pretty much the way, since the csv is the only thing that is authoritative you want to only keep the current csv file data. Some folks like to keep a history. It depends on what you need. Most of the time I don't care about the history so I make a little dbt macro something like
{%- macro autoidm_dropy() -%}
{%- set drop_query -%}
DROP SCHEMA IF EXISTS tap_name CASCADE
{%- endset -%}
{% do run_query(drop_query) %}
{%- endmacro -%}
meltano run dbt:dropy tap-csv target-whatever dbt:run
In your meltano.yml for dbt you add a command for dbt of
dropy: run-operation dropy
visch
10/18/2022, 1:09 PMjacob_matson
10/18/2022, 2:06 PM