Good morning (in CZ), Meltano + dbt running in inc...
# getting-started
j
Good morning (in CZ), Meltano + dbt running in incremental mode Using Snowflake as Meltano target, dbt transforms data there Need to store a state of pipeline runs. I could even benefit from joining Meltano state tables with loaded tables to perform an efficient incremental transformation in dbt. Unfortunately, I cannot store the Meltano state to Snowflake, the only supported targets are SQLLite, PostgreSQL and MSSQL. Btw. dbt can store their state tables in all supported databases. Is this something what is on your radar? Anyone else would appreciate an ability to store Meltano state tables in all widely used databases (for which a production-ready Meltano target exists, obviously)?
t
Can you define another pipeline that pushes meltano's data to Snowflake? I know that's a hack but I wonder if it would work...
a
Good morning (in CZ),
Good morning! Logged in from Santiago Chile! (UTC+3) 🇨🇱 this week, Meltano team is just wrapping up our 1-week colocation. 🙂
Need to store a state of pipeline runs.
Have you had a look at the new custom state backends feature? https://docs.meltano.com/concepts/state_backends
c
Hello @jan_soubusta 👋 We currently support the major cloud storage providers as remote state backends. Additional state backends such as Postgres, MSSQL, or even something like Dynamodb have been casually discussed but are not currently on our radar.
As @thomas_briggs suggested, one solution could be to load Meltano state from the state backend (either the SQLite backend db or a remote state backend) into Snowflake via a dedicated pipeline. If you have a specific state backend in mind for your use case, I'd also encourage you to submit a feature request and we can consider it as part of our usual prioritization.
j
Thanks for responses! Feature "state backends" looks good, at least, it helps me to get rid of PostgreSQL instance. But still, if I need to join Meltano state with loaded tables in dbt to make it sufficiently incremental, I need to load the state into the target.
t
Do you? If you enable target-snowflake's add_metadata_columns option doesn't the _sdc_batched_at column give you enough information to know what data is new?
a
@jan_soubusta - If you want to view latest state bookmarks from Snowflake, one thing you could try would be S3 (or similar) as a backend, and then mapping the S3 directory to an external table in Snowflake. That way, Snowflake will be able to query the latest state directly from S3, with no latency.
I haven't implemented this before but others have asked for similar in the past. If you get it working, I'm sure folks would appreciate the case study and learnings.
Btw. dbt can store their state tables in all supported databases.
This is new to me and I wasn't able to find the feature with a google search. A bit of tangent, but I'd appreciate any link or reference you have on that, as it could be helpful to reference as an example.
j
I really like when people challenge me - it helps me to find out what I am thinking/doing wrong 😉 You are right, dbt does not store ANY state, except it stores some JSON files locally, e.g.
run_results.json
. https://docs.getdbt.com/docs/deploy/about-state The trick is that you have to design models to be idempotent. E.g., design an incremental transformation as
select from X where not exists (select from X)
. As a database expert, to be honest, I doubt about it can perform sufficiently in many cases. But this is what dbt provides....
t
@jan_soubusta I think what you're looking for is here: https://docs.getdbt.com/docs/build/incremental-models
Particularly the bit about
SELECT MAX(dateCol) FROM {{ this }}
to figure out what data in your staging tables you need to pull into the model
But it's mentally a Monday for me so I may have forgotten all the details of your scenario 😛
j
This is correct, but is does not work for all tables. MERGE/UPSERT is quite common - you cannot always rely on source systems produce only new data. Moreover, during the time the extract part is running, new records can be added into the source system with timestamp older than the newest just downloaded. Btw. it exactly happened to me with the tap-github. But my main concern relates to performance - in dbt models represent what should be materialized. So instead of executing MERGE/UPSERT statement (which can be optimized by DB engine under-the-hood), you have to implement a set of insert/update CTEs, which may performance significantly worse.
t
If you identify new rows using _sdc_batched_at rather than a timestamp that exists in the source database you don't have to worry about timestamps getting crossed up.
And while we have had a couple cases where we need to write very different SQL for initial load vs. incremental load: 99% of the time the difference between the two is simply one
dateField > (SELECT MAX(dateField) FROM {{ this }})
clause.
Honestly I feel like you're overthinking this. If it were really that hard to get right would so many people be using it?