This is super interesting! Streaming SQL managed b...
# random
k
This is super interesting! Streaming SQL managed by dbt ❤️ Looks like the Materialize platform supports Kafka and a few RDBMS systems. I always thought it would be possible to run singer taps and targets in a ‘continuous’ mode, emulating streams (similar to the polling API for AWS SQS) 🤔 Possible collab? 😅 https://blog.getdbt.com/dbt-materialize-streaming-to-a-dbt-project-near-you/amp/
v
wow awesome read I'm super curious about Materialize now. ahhh here goes my morning
k
Streaming SQL is really interesting, though for now you need to have the right use-case for it. Given that the SQL ‘materialisation’ is just a query that runs for each new message (producing a stream of outputs, one record per run), there are still unsolved questions around i) how many messages can you keep and still be performant?, ii) how are complex joins and successive transformations (esp. of the dbt kind) managed? and iii) what happens when you change your SQL? Does every message get re-processed into the destination stream, or only messages after the change? dbt makes it super easy to re-write history and re-shape tables, but the same isn’t necessarily easy with streaming workloads. At least in my experience - maybe Materialize will crack it 😅
v
That is interesting, your detailed write up goes over a ton of issues. I've tended to be able to stay away from most streaming applications, good to read! Warning half baked idea, just threw this together! Feel free to ignore 😄 I'm sure this isn't a new idea, File systems based in git have been tried but with data integrations I haven't heard anyone talk about this (Other than the github actions thing that went over uploading your data to a git repo) I'd be really curious if someone was thinking about this. 10% of a thought out idea, haven't gone farther than this. Imagine that you specify the data in your source and target that you're trying to get to "integrate" ie get to your target system. The current system state is represented by a hash (Both in your source and your target) ie the data state is currently at hash "abc1232__yadayada" , any data change to the data on either side is a change that would be represented by some change set (think git commits, etc) Sometimes (just like in git) the data set diverges completely due to a Backup / Restore from some time frame. In that case you have to do a Full Sync again to your Target data set (instead of doing a merge). Changes after that would be represented as small change sets from the new starting point. To accomplish this you'd need some understanding of this framework in both the source and target: /source/api/integration1?targetstate="hashfromtarget" Response { sourcedata_integration1_currentstate: abcde, changes_since_targetstate_data: {} } Both sides would need to share some common implementation, but it's not clear how beneficial this would or would not be. In an ideal world this would get rid of anyone having to pick Full Table Migration or Incremental, or RealTime. You'd setup your integrations as real time and let them eat, they'd figure out what needs to happen when. It goes farther than API's that having a
time_changed
field that you pass a date into to get everything since that time. Really the date has nothing to do with data, the data is at some state, I should tell you what state I currently have and then you should tell me what I need to get to the latest state
k
I know git-style version control has been applied in this way in the data science world, but this idea is next-level 😅 Sounds like it would allow you to arbitrarily change the entire stream contents, which is cool. In other streaming contexts I have seen versioning used. In this scenario, it would be up to the Streaming SQL maintainer to add
select '1.0' as version, ...
and to increment the version if there are breaking message shape or field definition changes. Consumers then have to be 'version-aware', consuming messages of a given version. Its a slippery slope though - folks end up publishing
1.0
and
2.0
for an extended period whilst all the
1.0
-only consumers are made
2.0
aware, ad infinitum 😬 Nightmare.
v
Thanks for trying to understand my ranty barely legible text! Yeah nightmare is right, I just can't shake the idea for some reason as I feel when I'm coding / debugging in this abstraction layer I need to break out of. Always great reading your stuff, it gets my mind going every time