Meltano

How do you apply software engineering best practices to data ingestion?

Hey, community, who has something to share here? <@U06CCB0EUBC> <@U069EPD13KN> <@U06CQSJ5KFT> <@U06CQ9D5M5X> <@U06C8SN07SR>

<@U06CQSJ5KFT>, Your option sounds interesting; enlighten me :smile:

<@U069XD4GVRP> I think the credit originally goes to <@U06C51B75QE> or <@U06CCB0EUBC>, but the basic idea is that *not every API endpoint has the curtesy of sending "deleted_at" messages* so that you can keep track of archived data.

If you're not using a truncate / overwrite method for loading, this means that *you are possibly keeping track of data that is no longer valid*.

We therefore implemented a "parallel stream" method, where we load in only the ids for a specific stream using an inherited extractor (example hubspot.deals and hubspot_ids.deals) *using the truncate method*.

This way, we can use append method for our actual data, keeping track of historical values in case we need them (I've become a firm believer in "append only"), *but remove any stale values by joining the latest version of our data to it's truncated ids*.

Does this make any sense? I'm happy to go into more detail with an example

<@U06CQSJ5KFT> just to counter you append only, when I want append only I tend to want history. So instead of append only I do a history table (dbt implements them as an scd) I use something like <https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver16>

Depends on size of your data &lt; ~billions of rows and you're good with Temporal

<@U06CCB0EUBC> is this only in SQL server? It can be applied in bigquery using dbt?

I haven't dove deep into dbt's scd implementation but <https://docs.getdbt.com/docs/build/snapshots#what-are-snapshots> should do what you're after.

Then it looks like Extract DB/Schemas (Meltano writes to) -&gt; Stage DB / Schems (With SCD and History) -&gt; Marts /etc

There's still a number of folks who prefer append only so I'm probably missing the full reason why maybe scd's are hard :shrug:

you got me, I was just sharing stuff I've seen at places that waste ungodly amounts of time!

Boy and here I was hoping I could close my "best practices for EL" series with part 2, now I'm not sure I can finish after part 3 :smile:

one thing I wanted to implement at gitlab but ran out of time (_thanks Meltano…)_ was a blue/green deployment structure. the SWAP WITH command on snowflake would’ve made that easy

Oh yes! <@U069EPD13KN>, one of my favourite shares: <https://blog.montrealanalytics.com/blue-green-deployment-with-dbt-and-snowflake-922f1c658011>

I loved the additional practices so much that I created a short Twitter thread! <@U06CCB0EUBC> <https://twitter.com/meltanodata/status/1673675769912909825>