Looking for some help on grouping tables into extr...
# getting-started
e
Looking for some help on grouping tables into extractors/taps for load balancing. Brief background: 100's of tables spread over 5 db servers. Each db server allows 5 simultaneous connections. Most tables have reasonable volume, but some are doing 100's of rows per second. I tried all tables listed in a single extractor, but the big tables take too long and starve the little tables. I tried extractor-per-table, but learned of my 5 connections limit. I'm going for 5 extractors per db with the tables divided up among the extractors (randomly for now, could take care to split up busy tables). My question is about state: It seems like state is tracked at the extractor level, so, would it be possible to move a table from one extractor to another in this arrangement if needed for balancing? Any other suggestions on how to handle this?
d
State is tracked based on the Job ID passed to
meltano elt
, which matches the schedule name if you used
meltano schedule
to define a scheduled pipeline. So state is tracked per scheduled pipeline, and multiple pipelines can use the same extractor. In your case, it sounds like you have an extractor for each DB, and then 5 inheriting extractors for each of those with a specific subset of selected tables, and then one schedule for each of those inheriting extractors. Since state is tracked per schedule, moving a stream from one schedule to another would require you to also move the state from one schedule to the other, which right now would require you to manually modify the
job
table in the system database: https://meltano.com/docs/project.html#system-database I'd be interested in exploring automatic parallelization over different tap processes using "subpipelines" so that we could track the state for each DB's tables in one place while the way the tables are balanced over different subprocesses can change over time. @ken_payne has suggested something like that before as well. Maybe it's time to create an issue, Ken?
e
You got my situation just right. It seems like an extractor could have a "max_db_connections" setting and then I could just put all of my tables in that one extractor and it would round-robin processing each table in turn.
Thank you for the info on migrating the state... hopefully I won't face it 🙂
Offhand, do you know if an extractor with multiple tables to sync will round-robin batches or does it try to fully sync each table before moving onto the next?
d
@edward_smith It depends on the specific tap, but typically they just naively loop over the streams/tables and handle them one by one
e
@douwe_maan wrong Edward 😉
d
Sorry about that 😄
k
Thanks for the nudge @douwe_maan - issue here. @edward_smith I would appreciate your thoughts and comments on it too. Especially your experience and requirements around connection pooling, and more generally taking upstream resource constraints into account when making pipeline sharding decisions 😅