Hey folks, one of our EL pipelines comes from an a...
# singer-tap-development
j
Hey folks, one of our EL pipelines comes from an application database, so it’s always being updated. It doesn’t happen often since we run our pipelines on off-hours, but sometime this can cause race conditions where two related tables might get sync’d in different states depending on the order. For example, maybe we copy
payment_methods
first, and then later
transactions
.
transactions
may have a row with a
payment_method_id
which was inserted after copying the
payment_methods
table. This isn’t much of a problem from a data analysis perspective but it breaks DBT source data tests. Would wrapping the whole tap operation in a database transaction solve this? Would it create performance problems to do a lot of reads in a transaction? Is there a different solution people use for this? We could just remove our DBT foreign key tests but I’d prefer data to be a little stale but consistent
v
I haven't hit this directly with DBT (as I don't do a bunch of testing 😕 ) but I've hit this in other domains, eventual consistency is the word I think. It's almost like you'd have to make your dbt tests tolerate ids being out of date for some period of time. Maybe 1h or 1d etc? So either tolerate the consistency issue, or don't tolerate it by always being sure everything is wrapped in something like a snapshot isolation level https://learn.microsoft.com/en-us/sql/t-sql/statements/set-transaction-isolation-level-transact-sql?view=sql-server-ver16 https://learn.microsoft.com/en-us/sql/t-sql/statements/set-transaction-isolation-level-transact-sql?view=sql-server-ver16#:~:text=onl[…]0that%20data limits your parallelization capacity though (removes it?) idk what the easiest thing to do is I"m sure I have this happen in my data as well. there's my 2 cents I'm curious on others 😄
I go with 🙈 today but that's definitely bad in some situations
a
Hey @julian_knight - checking my understanding here first: 1. sounds like the application is performing it's unit of work in a transaction. 2. When you sync, you first sync a reference table (payment method) then the business transaction table (transactions). So this is a late arriving dimensions problem? You have a few options to expect that and deal with it: https://medium.com/dataseries/resolving-late-arriving-dimensions-c0ebc9f818c3 At sync time you could avoid the situation by creating two pipelines - always doing transactions first (facts), followed by references (dimensions), then running your transforms. This would only be feasible in simple scenarios - hard deleted references could cause issues for example.
j
Thanks @visch and @aaron_phethean! It does sound like a late arriving dimensions problem. I could potentially ignore fact tables with missing dimensions, although it would require some more complicated dbt tests or tooling. SNAPSHOT isolation level sounds like what I had in mind. It seems that's the default behavior for postgres transactions. Postgres even seems to support a way to use the same snapshot in multiple transactions, in theory allowing parallel reads https://www.postgresql.org/docs/current/sql-set-transaction.html
v
Nice @aaron_phethean great article. I don't think about all things in star scheme but it's still a great read!! Good luck @julian_knight let me know if that works for you! Hopefully I didn't send you down the wrong path!
a
@julian_knight - Transactions might help you but they can also create issues occasionally. (I won't try to speak to Postgres's implementation because I just don't know it well.) Would you consider a sync strategy where the 'fact-like' tables always sync first and 'dim-like' tables sync afterwards? For instance, in your example above, you might deliberately sync
payment_methods
after the rest of the tables. Similarly, you could also just sync those few trouble-tables again after the main load completes. In theory, you could use a strategy like this to avoid orphaned fact records that otherwise would fail referential integrity checks.
j
Would you consider a sync strategy where the ‘fact-like’ tables always sync first and ‘dim-like’ tables sync afterwards?
I don’t love this solution. The application is always evolving and adding new relationships, so complicating the pipeline configuration to match sounds like large a maintenance burden. It also seems a bit of a design smell; the transform layer is where dealing with quirks like this should go; the EL layer should be largely agnostic to the source data, instead replicating it as closely as possible and moving on. Additionally, the source data isn’t oriented around star schema, so deciding which relations should go where is a hard problem, and likely not limited to just 2 copy cycles, but potentially several. If I can’t use transactions, I’d instead lean towards one of these solutions: • using transforms to filter out data that is newer than when the EL started • take a snapshot of the database before running EL • ignore it
a
The transform layer, I think, is generally where people are solving for this. In past, I had a 'completed day' and 'completed week' flag in the calendar dimension so I could filter out partial/incomplete time periods that otherwise would show incorrect inferences.