hey everyone have a silly question with regard to `tap sprea Meltano #plugins-general

hey everyone, have a silly question with regard to...

jacob_matson

10/17/2022, 8:36 PM

hey everyone, have a silly question with regard to

tap-spreadsheets-anywhere

target-parquet

but it might be more generally applicable. I'm hitting a csv file on the web, where the data inside updates ~once per day. When I execute

meltano run tap-spreadsheets-anywhere target-parquet

for a csv file on the web, it fails to find a "modified date", thus executes a full load, duplicating the entire CSV. This does not occur if it is located in an S3 bucket (or on my local machine, anywhere where meltano can find a modified date). In this case (web sourced CSV), I want to execute an upsert instead of an insert. Is there any way to do this? My current work around is dropping the dataset from the file (adding a dupe check into my transform step), but that is kludgy at best because there is no import date timestamp in the table built by meltano (if there is an import data, I can simply grab the latest record for a given key).

visch

10/18/2022, 2:13 AM

Problem comes down to the api. I'm going to bastardize this example to try to get the point across, there's a lot of nuance here as not everything is an http call, and a host of other reasons. So when you want a csv file what are you actually doing? It's a HTTP request so something like

GET /csvfile

(we'll ignore the header and other bits) The csv file doesn't have an api that allows you to filter it. That would look something like

GET /data?modifieddategt=yesterday

Therefore your tap is pulling all of the "csv" regardless. There's lots of tricks, but generally a csv doesn't' give you the capacity to do a true "incremental" load. For s3 you'll see multiple csv files with different modified dates, and you'll query the latest s3 files with a modified date beyond what you're looking for. This works as it's multiple

GET /csvfile

requests. --- to your question, normally for csv files you can still do an upsert by specefiying the primary key for that file.

if it is located in an S3 bucket (or on my local machine, anywhere where meltano can find a modified date)

Are you sure about this? It's the file modified date, not the row modified date

visch

10/18/2022, 2:15 AM

My current work around is dropping the dataset from the file (adding a dupe check into my transform step), but that is kludgy at best because there is no import date timestamp in the table built by meltano (if there is an import data, I can simply grab the latest record for a given key).

This is what I do fwiw

jacob_matson

10/18/2022, 2:22 AM

normally for csv files you can still do an upsert by specefiying the primary key for that file.

will do some testing on this, it could be an issue with the target if the upsert is being dumb (i.e. not handling the key)

Are you sure about this? It's the file modified date, not the row modified date

right, file modified date. sorry if that was unclear in regards to the work around, is there anyway to add a column that holds "last import data" by row? I swear meltano used to include that default but i can't find it.

jacob_matson

10/18/2022, 2:45 AM

I guess I can implement a "source pruner" as a pre-hook in dbt but don't feel great about it tbh

visch

10/18/2022, 1:06 PM

in regards to the work around, is there anyway to add a column that holds "last import data" by row? I swear meltano used to include that default but i can't find it.

Yes, it's per tap. Most taps implement

_sdc_*

metadata columns.

_sdc_extracted_at

might be what you're looking for here

I guess I can implement a "source pruner" as a pre-hook in dbt but don't feel great about it tbh

For a single csv that you want to import over and over it's pretty much the way, since the csv is the only thing that is authoritative you want to only keep the current csv file data. Some folks like to keep a history. It depends on what you need. Most of the time I don't care about the history so I make a little dbt macro something like

Copy code

{%- macro autoidm_dropy() -%}
    {%- set drop_query -%}
        DROP SCHEMA IF EXISTS tap_name CASCADE
    {%- endset -%}
    {% do run_query(drop_query) %}
{%- endmacro -%}

meltano run dbt:dropy tap-csv target-whatever dbt:run

In your meltano.yml for dbt you add a command for dbt of

Copy code

dropy: run-operation dropy

visch

10/18/2022, 1:09 PM

Spreadsheets anywhere should really be populating the metadata fields as that would give you the column level stuff you're after but dropping is just as easy 🤷

jacob_matson

10/18/2022, 2:06 PM

Ok really good tip on the run ops. I’ll add those. I’m going to see if the other variants of the tap have implemented the time stamp column and go from there.

2 Views

Open in Slack

Previous Next