Hello everybody, I'm having trouble understanding...
# troubleshooting
a
Hello everybody, I'm having trouble understanding the incremental loading. We use Meltano and Docker to extract our backend every night and load it into our analytics warehouse. Everything works so far, but recently I added a table that is >100M rows and the time to load everything has completely exploded. When looking at the logs, I can see rows like that :
Copy code
2024-11-04T14:57:02.091508Z [info     ] Incremental state has been updated at 2024-11-04 14:57:02.091402.
Also, all my tables have a
date_modified
column. Since we are using Docker, we build the whole Meltano project from scratch every time we load our data. So the info about
incremental state
is lost, correct ? Is there a way in my
meltano.yml
file to setup my
tap-postgres
so that I can fetch only the data from the last week (for instance), and then upsert the resulting data ?
1
v
There's a lot to unpack here. But I'll try to just answer you. Generally it seems like some choices about architecture may need to change.
So the info about
incremental state
is lost, correct ?
yes, depending on your meltano.yml file specefically how you're managing your engine / state backend.
Is there a way in my
meltano.yml
file to setup my
tap-postgres
so that I can fetch only the data from the last week (for instance), and then upsert the resulting data ?
Yes, check out state backends / backend engine
Everything works so far, but recently I added a table that is >100M rows and the time to load everything has completely exploded.
Easiest fix for you might be to just run this one table on a separate schedule it everything else is working great for you.
a
Yeah I was looking into the state backend documentation, we might need to store the results in our Postgres database. If I wanted to do something simpler, would there be a way for this specific table to fetch only the last 7 days of data ? Something like this ?
Copy code
stream_maps:
        public-metric_value:
          __key_properties__:
          - id
          - instance_id
          comment: sha3(comment) if comment else ''
          instance_id: str('${instance_id}')
          meltano_extracted_at: datetime.datetime.now()
    select:
    - public-metric_value.*
    where:
    - date_update >= current_date - interval '7 days'
I couldn't find doc regarding
where
filters
v
honestly incremental is so much simpler I would do that instead
1
a
Gotcha, I'll look into it. Thanks !
a
Hell yeah ! Thanks @visch
dancingpenguin 1
🔥 1
💯 1
np 1
Last question @visch, do you know what happens if the column I use for my incremental state has null values ?
Will the corresponding rows be fetched and updated on every run ?
v
we deal with it I forget the exact implementation, generally your replication key's shouldn't have nulls would be the first recommendation I'd have for you
e
Hi @Arnaud Stephan! Postgres supports
NULLS FIRST
(see the flag in the tap code) so if the replication key column has null values, they come in first and shouldn't affect the state if there are non-null values.
1
1
a
generally your replication key's shouldn't have nulls would be the first recommendation I'd have for you
Sadly I'm dealing with a shitty backend 🫠
v
Was just pulling up the code, @Edgar Ramírez (Arch.dev)’s got you 😄
a
Thanks to you both !
np 2