Sorry if this is the wrong channel, but I’m workin...
# plugins-general
m
Sorry if this is the wrong channel, but I’m working reconstruct our
event
storage. We’re currently using an RDMS which is expensive and kinda useless. We want to move to S3. Does anyone have any suggestions for how we store that data? It would look something like this: • Event is created • Event is written to S3 <-- format? • Meltano extracts from S3 and loads into our DW.
a
Hi Matt. Love the question - maybe #random would be better for the answer. Sounds like you are trying to create an event stream pipeline (or perhaps a data lake). I'd think about a few questions before choosing the best storage format: • how many events • what latency is acceptable (between an event and load into DW) • is it ok to lose an event • how long do you want to keep the events Mostly I'm wondering if S3 in the middle is a good choice. Possible architectures and off the shelf services: • as you are using taps, maybe you are intending to build some thing similar to stitch: https://stackshare.io/stitch/how-stitch-consolidates-a-billion-records-per-day • S3 in the middle is probably a good idea if you have billions of events and tons of topics, otherwise kafka has become a popular transactional choice if a simple queue won't do and if the events are not meant to be stored long term: https://www.stitchdata.com/blog/100-billion-records-later-refining-our-etl-service/ • a non singer / meltano option - be interested in the cost comparison: https://docs.aws.amazon.com/streams/latest/dev/kinesis-dg.pdf#building-producers • storing raw data long term needs a good sharding structure, then a bulk format like parquet works pretty well with a lot of tools: https://aws.amazon.com/blogs/big-data/stream-cdc-into-an-amazon-s3-data-lake-in-parquet-format-with-aws-dms/ what about just storing the event ‘as is’ eg in json if that's how it arrives TL; DR one word answer to “what is a good format on S3”: parquet
m
I see what you’re getting at. Someone on my team is working on the actual “streaming” part. I think the reason I was thinking of using S3 as a middle layer was so I could use redshift target and avoid having to load to the DW in real-time (batched). It sounds like that might actually be the best option though. In fact the more I think of it, some of the benefits of using meltano for this particular use-case are kind of moot, since I don’t really have to worry about
upserts
and the like. Thanks for all the reading!