How do you apply software engineering best practic...
# best-practices
s
How do you apply software engineering best practices to data ingestion?
s
Hey, community, who has something to share here? @visch @taylor @Stéphane Burwash @aaron_phethean @daniel_walker
@Stéphane Burwash, Your option sounds interesting; enlighten me 😄
s
@Sven Balnojan I think the credit originally goes to @aj_steers or @visch, but the basic idea is that not every API endpoint has the curtesy of sending "deleted_at" messages so that you can keep track of archived data. If you're not using a truncate / overwrite method for loading, this means that you are possibly keeping track of data that is no longer valid. We therefore implemented a "parallel stream" method, where we load in only the ids for a specific stream using an inherited extractor (example hubspot.deals and hubspot_ids.deals) using the truncate method. This way, we can use append method for our actual data, keeping track of historical values in case we need them (I've become a firm believer in "append only"), but remove any stale values by joining the latest version of our data to it's truncated ids. Does this make any sense? I'm happy to go into more detail with an example
v
@Stéphane Burwash just to counter you append only, when I want append only I tend to want history. So instead of append only I do a history table (dbt implements them as an scd) I use something like https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver16
Depends on size of your data < ~billions of rows and you're good with Temporal
s
@visch is this only in SQL server? It can be applied in bigquery using dbt?
v
I haven't dove deep into dbt's scd implementation but https://docs.getdbt.com/docs/build/snapshots#what-are-snapshots should do what you're after. Then it looks like Extract DB/Schemas (Meltano writes to) -> Stage DB / Schems (With SCD and History) -> Marts /etc
There's still a number of folks who prefer append only so I'm probably missing the full reason why maybe scd's are hard 🤷
s
@visch is going all in!1
v
you got me, I was just sharing stuff I've seen at places that waste ungodly amounts of time!
s
Boy and here I was hoping I could close my "best practices for EL" series with part 2, now I'm not sure I can finish after part 3 😄
s
It never ends :p
t
one thing I wanted to implement at gitlab but ran out of time (thanks Meltano…) was a blue/green deployment structure. the SWAP WITH command on snowflake would’ve made that easy
s
I loved the additional practices so much that I created a short Twitter thread! @visch https://twitter.com/meltanodata/status/1673675769912909825