Good morning in CZ Meltano + dbt running in incremental mode Meltano #getting-started

Good morning (in CZ), Meltano + dbt running in inc...

jan_soubusta

01/13/2023, 9:01 AM

Good morning (in CZ), Meltano + dbt running in incremental mode Using Snowflake as Meltano target, dbt transforms data there Need to store a state of pipeline runs. I could even benefit from joining Meltano state tables with loaded tables to perform an efficient incremental transformation in dbt. Unfortunately, I cannot store the Meltano state to Snowflake, the only supported targets are SQLLite, PostgreSQL and MSSQL. Btw. dbt can store their state tables in all supported databases. Is this something what is on your radar? Anyone else would appreciate an ability to store Meltano state tables in all widely used databases (for which a production-ready Meltano target exists, obviously)?

thomas_briggs

01/13/2023, 2:02 PM

Can you define another pipeline that pushes meltano's data to Snowflake? I know that's a hack but I wonder if it would work...

aaronsteers

01/13/2023, 2:49 PM

Good morning (in CZ),

Good morning! Logged in from Santiago Chile! (UTC+3) 🇨🇱 this week, Meltano team is just wrapping up our 1-week colocation. 🙂

Need to store a state of pipeline runs.

Have you had a look at the new custom state backends feature? https://docs.meltano.com/concepts/state_backends

cody_hanson

01/13/2023, 2:50 PM

Hello @jan_soubusta 👋 We currently support the major cloud storage providers as remote state backends. Additional state backends such as Postgres, MSSQL, or even something like Dynamodb have been casually discussed but are not currently on our radar.

cody_hanson

01/13/2023, 2:54 PM

As @thomas_briggs suggested, one solution could be to load Meltano state from the state backend (either the SQLite backend db or a remote state backend) into Snowflake via a dedicated pipeline. If you have a specific state backend in mind for your use case, I'd also encourage you to submit a feature request and we can consider it as part of our usual prioritization.

jan_soubusta

01/13/2023, 5:03 PM

Thanks for responses! Feature "state backends" looks good, at least, it helps me to get rid of PostgreSQL instance. But still, if I need to join Meltano state with loaded tables in dbt to make it sufficiently incremental, I need to load the state into the target.

thomas_briggs

01/13/2023, 5:52 PM

Do you? If you enable target-snowflake's add_metadata_columns option doesn't the _sdc_batched_at column give you enough information to know what data is new?

aaronsteers

01/13/2023, 6:36 PM

@jan_soubusta - If you want to view latest state bookmarks from Snowflake, one thing you could try would be S3 (or similar) as a backend, and then mapping the S3 directory to an external table in Snowflake. That way, Snowflake will be able to query the latest state directly from S3, with no latency.

aaronsteers

01/14/2023, 1:08 AM

I haven't implemented this before but others have asked for similar in the past. If you get it working, I'm sure folks would appreciate the case study and learnings.

aaronsteers

01/14/2023, 1:11 AM

Btw. dbt can store their state tables in all supported databases.

This is new to me and I wasn't able to find the feature with a google search. A bit of tangent, but I'd appreciate any link or reference you have on that, as it could be helpful to reference as an example.

jan_soubusta

01/16/2023, 8:16 AM

I really like when people challenge me - it helps me to find out what I am thinking/doing wrong 😉 You are right, dbt does not store ANY state, except it stores some JSON files locally, e.g.

run_results.json

. https://docs.getdbt.com/docs/deploy/about-state The trick is that you have to design models to be idempotent. E.g., design an incremental transformation as

select from X where not exists (select from X)

. As a database expert, to be honest, I doubt about it can perform sufficiently in many cases. But this is what dbt provides....

thomas_briggs

01/17/2023, 2:43 PM

@jan_soubusta I think what you're looking for is here: https://docs.getdbt.com/docs/build/incremental-models

thomas_briggs

01/17/2023, 2:44 PM

Particularly the bit about

SELECT MAX(dateCol) FROM {{ this }}

to figure out what data in your staging tables you need to pull into the model

thomas_briggs

01/17/2023, 2:44 PM

But it's mentally a Monday for me so I may have forgotten all the details of your scenario 😛

jan_soubusta

01/17/2023, 5:40 PM

This is correct, but is does not work for all tables. MERGE/UPSERT is quite common - you cannot always rely on source systems produce only new data. Moreover, during the time the extract part is running, new records can be added into the source system with timestamp older than the newest just downloaded. Btw. it exactly happened to me with the tap-github. But my main concern relates to performance - in dbt models represent what should be materialized. So instead of executing MERGE/UPSERT statement (which can be optimized by DB engine under-the-hood), you have to implement a set of insert/update CTEs, which may performance significantly worse.

thomas_briggs

01/17/2023, 5:47 PM

If you identify new rows using _sdc_batched_at rather than a timestamp that exists in the source database you don't have to worry about timestamps getting crossed up.

thomas_briggs

01/17/2023, 5:49 PM

And while we have had a couple cases where we need to write very different SQL for initial load vs. incremental load: 99% of the time the difference between the two is simply one

dateField > (SELECT MAX(dateField) FROM {{ this }})

clause.

thomas_briggs

01/17/2023, 5:51 PM

Honestly I feel like you're overthinking this. If it were really that hard to get right would so many people be using it?

2 Views

Open in Slack

Previous Next