I’ve been reading lately about ML and streaming da...
# best-practices
e
I’ve been reading lately about ML and streaming data.. while using Meltano to move data around.. and trying to understand how to marry the Meltano ELT pipelines with my streaming data sources.. https://www.infoq.com/news/2021/12/huyen-realtime-ml/ My guess is, Meltano is not setup to speak to event-bus pub sub architectures.. but perhaps I can trigger REST API calls against meltano in a enterprise that has that.. with some basic coding is there anyone out there besides me , using Meltano to move classic Tabular data… but also trying to figure out how to capture, move, store streaming data?
my main concern, Chip brings in this part “”Moving prediction to streaming processing while leaving model training as a batch process, however, means that there are now two data pipelines which must be maintained, and this is a common source of errors. Huyen pointed out that because static data is bounded, it can be considered a subset of streaming data, and so can also be handled by streaming processing; thus the two pipelines can be unified. To support this, she recommended an event-driven microservice architecture, where instead of using REST API calls to communicate, microservices use a centralized event bus or stream to send and receive messages. Once model training is also converted to a streaming process, the stage is set for continual learning. Huyen pointed out several advantages of frequent model updates. First, it is well known that model accuracy in production tends to degrade over time: as real-world conditions change, data distributions drift. This can be due to seasonal factors, such as holidays, or sudden world-wide events such as the COVID-19 pandemic. There are many available solutions for monitoring accuracy in production, but Huyen claimed these are often “shallow”; they may point out a drop in accuracy but they do not provide a remedy. The solution is to continually update and deploy new models.“”
perhaps I do this, I have a set of N streams of data.. these are images being piped in over a socket (sort of like a webcam) for each of the N image streams I can choose to 1. run inference on it with a model 2. store that image in an s3 bucket 3. store it in memory and trigger retraining of the model only in #2 would meltano perhaps come into picture then.. detection of new images or continuous “fast-batch” would move the data out of a landing S3 bucket into something a bit more curated based on step #1 .. say if I detect X # of objects.. place the image in the Xth bucket…
I’m wondering aloud, if the “scheduler” swap of Airflow to.. this event pub-sub style (I am a newb here but.. I know of Kafka, Apache Beam.. but are they schedulers or .. compatible with meltano as ways to set the plumbing, I dunno) .. is all that’s maybe needed to bring meltano into the foray .. but it’s unclear if im doing something totally unexpected trying to code that or not