hey everyone, have a silly question with regard to...
# plugins-general
j
hey everyone, have a silly question with regard to
tap-spreadsheets-anywhere
&
target-parquet
but it might be more generally applicable. I'm hitting a csv file on the web, where the data inside updates ~once per day. When I execute
meltano run tap-spreadsheets-anywhere target-parquet
for a csv file on the web, it fails to find a "modified date", thus executes a full load, duplicating the entire CSV. This does not occur if it is located in an S3 bucket (or on my local machine, anywhere where meltano can find a modified date). In this case (web sourced CSV), I want to execute an upsert instead of an insert. Is there any way to do this? My current work around is dropping the dataset from the file (adding a dupe check into my transform step), but that is kludgy at best because there is no import date timestamp in the table built by meltano (if there is an import data, I can simply grab the latest record for a given key).
v
Problem comes down to the api. I'm going to bastardize this example to try to get the point across, there's a lot of nuance here as not everything is an http call, and a host of other reasons. So when you want a csv file what are you actually doing? It's a HTTP request so something like
GET /csvfile
(we'll ignore the header and other bits) The csv file doesn't have an api that allows you to filter it. That would look something like
GET /data?modifieddategt=yesterday
Therefore your tap is pulling all of the "csv" regardless. There's lots of tricks, but generally a csv doesn't' give you the capacity to do a true "incremental" load. For s3 you'll see multiple csv files with different modified dates, and you'll query the latest s3 files with a modified date beyond what you're looking for. This works as it's multiple
GET /csvfile
requests. --- to your question, normally for csv files you can still do an upsert by specefiying the primary key for that file.
if it is located in an S3 bucket (or on my local machine, anywhere where meltano can find a modified date)
Are you sure about this? It's the file modified date, not the row modified date
My current work around is dropping the dataset from the file (adding a dupe check into my transform step), but that is kludgy at best because there is no import date timestamp in the table built by meltano (if there is an import data, I can simply grab the latest record for a given key).
This is what I do fwiw
j
normally for csv files you can still do an upsert by specefiying the primary key for that file.
will do some testing on this, it could be an issue with the target if the upsert is being dumb (i.e. not handling the key)
Are you sure about this? It's the file modified date, not the row modified date
right, file modified date. sorry if that was unclear in regards to the work around, is there anyway to add a column that holds "last import data" by row? I swear meltano used to include that default but i can't find it.
I guess I can implement a "source pruner" as a pre-hook in dbt but don't feel great about it tbh
v
in regards to the work around, is there anyway to add a column that holds "last import data" by row? I swear meltano used to include that default but i can't find it.
Yes, it's per tap. Most taps implement
_sdc_*
metadata columns.
_sdc_extracted_at
might be what you're looking for here
I guess I can implement a "source pruner" as a pre-hook in dbt but don't feel great about it tbh
For a single csv that you want to import over and over it's pretty much the way, since the csv is the only thing that is authoritative you want to only keep the current csv file data. Some folks like to keep a history. It depends on what you need. Most of the time I don't care about the history so I make a little dbt macro something like
Copy code
{%- macro autoidm_dropy() -%}
    {%- set drop_query -%}
        DROP SCHEMA IF EXISTS tap_name CASCADE
    {%- endset -%}
    {% do run_query(drop_query) %}
{%- endmacro -%}
meltano run dbt:dropy tap-csv target-whatever dbt:run
In your meltano.yml for dbt you add a command for dbt of
Copy code
dropy: run-operation dropy
Spreadsheets anywhere should really be populating the metadata fields as that would give you the column level stuff you're after but dropping is just as easy 🤷
j
Ok really good tip on the run ops. I’ll add those. I’m going to see if the other variants of the tap have implemented the time stamp column and go from there.