OK, I need to take a crack at making taps multipro...
# singer-tap-development
f
OK, I need to take a crack at making taps multiprocessing/parallel/async. Is there any work already started on this? Any suggestions or requirements from a Meltano perspective to have whatever I come up with, if I get it working, accepted into the SDK?
e
IMO, this is something almost beyond the scope of the Tap.. why? Well when I wrote my tap.. I wrote it in python to call out to a C++ Thrift Server.. it was in the Thrift Server where I wrote the parallelization and async code.. because that was what made sense... for your tap.. it'd be hard to recommend or codify parallelization in the SDK without more information from you.. what are you particularly trying to parallelize and if you're in python.. is it not just up to you how you write your tap? for reference my tap queries the entire alphabet up to length 4, and we parallelized this query 32 ways with 32 concurrent connections to a client program (Socket based) ..
perhaps others can chime in.. for me I just batch called from python.. the entire query.. no parallelization there.. and let the C++ thrift server take that entire vector and break it down into chunks. which each get an Async thread to run and process their chunk of data and spit back whatever the API returns to be stored in the target..
a
I can speak to this question:
Any suggestions or requirements from a Meltano perspective to have whatever I come up with, if I get it working, accepted into the SDK?
Yes, we would definitely be interested in accepting this into the SDK. And we have some exploratory discussion here for starters.
f
I think concurrency belongs in the SDK and Meltano solution. While we could write a server to take batch requests, the whole idea of Meltano / SingerIO is ELT. I don't mean to be critical, but it's quite surprising that the SDK does not have a built-in capability to run concurrent child streams, for instance, or partitions. Since a main sources of data are APIs, and APIs have known limitations such as the latency for round trip and rate limiting, this is precisely where a feature like concurrency is needed. Unfortunately it does not sound like there is a short-term / easy / quick solution.
a
I don't mean to be critical, but it's quite surprising that the SDK does not have a built-in capability to run concurrent child streams, for instance, or partitions.
Completely fair. I think organically this may not come faster to "hi pri" status because in most/many cases the target is the bottleneck.
We started out with a parallel implementation in the first iteration, and removed it for stability reasons.