OK I need to take a crack at making taps multiprocessing par Meltano #singer-tap-development

OK, I need to take a crack at making taps multipro...

fred_reimer

02/23/2022, 4:24 PM

OK, I need to take a crack at making taps multiprocessing/parallel/async. Is there any work already started on this? Any suggestions or requirements from a Meltano perspective to have whatever I come up with, if I get it working, accepted into the SDK?

emcp

02/23/2022, 6:01 PM

IMO, this is something almost beyond the scope of the Tap.. why? Well when I wrote my tap.. I wrote it in python to call out to a C++ Thrift Server.. it was in the Thrift Server where I wrote the parallelization and async code.. because that was what made sense... for your tap.. it'd be hard to recommend or codify parallelization in the SDK without more information from you.. what are you particularly trying to parallelize and if you're in python.. is it not just up to you how you write your tap? for reference my tap queries the entire alphabet up to length 4, and we parallelized this query 32 ways with 32 concurrent connections to a client program (Socket based) ..

emcp

02/23/2022, 6:04 PM

perhaps others can chime in.. for me I just batch called from python.. the entire query.. no parallelization there.. and let the C++ thrift server take that entire vector and break it down into chunks. which each get an Async thread to run and process their chunk of data and spit back whatever the API returns to be stored in the target..

aaronsteers

02/23/2022, 6:12 PM

I can speak to this question:

Any suggestions or requirements from a Meltano perspective to have whatever I come up with, if I get it working, accepted into the SDK?

Yes, we would definitely be interested in accepting this into the SDK. And we have some exploratory discussion here for starters.

fred_reimer

02/23/2022, 6:32 PM

I think concurrency belongs in the SDK and Meltano solution. While we could write a server to take batch requests, the whole idea of Meltano / SingerIO is ELT. I don't mean to be critical, but it's quite surprising that the SDK does not have a built-in capability to run concurrent child streams, for instance, or partitions. Since a main sources of data are APIs, and APIs have known limitations such as the latency for round trip and rate limiting, this is precisely where a feature like concurrency is needed. Unfortunately it does not sound like there is a short-term / easy / quick solution.

aaronsteers

02/23/2022, 6:43 PM

I don't mean to be critical, but it's quite surprising that the SDK does not have a built-in capability to run concurrent child streams, for instance, or partitions.

Completely fair. I think organically this may not come faster to "hi pri" status because in most/many cases the target is the bottleneck.

aaronsteers

02/23/2022, 6:44 PM

We started out with a parallel implementation in the first iteration, and removed it for stability reasons.

Open in Slack

Previous Next