Brainstorm time!!! What are the essential practice...
# best-practices
u
Brainstorm time!!! What are the essential practices to adopt for data ingestion pipelines?
s
I'm just going to mention all of your because I already got great feedback from you in the past: @Stéphane Burwash @Matt Menzenski @thomas_briggs @visch @pat_nadolny @aaronsteers @jacob_matson @gary_james @matthew_braddy @magnus_avitsland @jeff_mcmahon @kk @taylor @Henning Holgersen@alexander_butler ... and many more of course!
d
Measuring and estimating data volume and determining ingestion strategy based on that. e.g. full_table vs incremental ingestion, view vs table vs incremental materialization
m
was thinking that too, especially relating to scheduling
d
Monitoring for cost and performance of your warehouse.
RE: least-access - that also applies to source data.
a
Plan for the frequency of data access. I would not store 100s of millions of rows of data directly in the DWH I am planning to only query analytically once a quarter vs keeping it in cold storage and layering an external table for infrequent access at the cost of latency / query time.
j
Good one @alexander_butler - related to that is staffing a team that is running once per day batch vs streaming (or micro batches) is very different.
a
If your data pipelines will feed (even peripherally) customer facing embedded analytics or data apps over your DWH, do not stop at integration testing. Engage in chaos engineering. How does the system behave under unexpected circumstances. Run an EL pipeline where you have a max out the memory or cut network egress during a sample sync or simulate an EC2 spot instance interruption. It sounds difficult but its actually not too bad. But I do work at a software delivery company that enables this without much fuss so there is that 🤷
s
Wow, that's a hell lot of responses! To @alexander_butler your point, I'm a big fan of chaos engineering inside the data world and have written about it more than once. And I think these are my least read pieces 😉 So cool to see that there's progress happening in that area.