Looking for some help on grouping tables into extractors tap Meltano #getting-started

Looking for some help on grouping tables into extr...

edward_smith

04/26/2021, 3:13 PM

Looking for some help on grouping tables into extractors/taps for load balancing. Brief background: 100's of tables spread over 5 db servers. Each db server allows 5 simultaneous connections. Most tables have reasonable volume, but some are doing 100's of rows per second. I tried all tables listed in a single extractor, but the big tables take too long and starve the little tables. I tried extractor-per-table, but learned of my 5 connections limit. I'm going for 5 extractors per db with the tables divided up among the extractors (randomly for now, could take care to split up busy tables). My question is about state: It seems like state is tracked at the extractor level, so, would it be possible to move a table from one extractor to another in this arrangement if needed for balancing? Any other suggestions on how to handle this?

douwe_maan

04/26/2021, 3:26 PM

State is tracked based on the Job ID passed to

meltano elt

, which matches the schedule name if you used

meltano schedule

to define a scheduled pipeline. So state is tracked per scheduled pipeline, and multiple pipelines can use the same extractor. In your case, it sounds like you have an extractor for each DB, and then 5 inheriting extractors for each of those with a specific subset of selected tables, and then one schedule for each of those inheriting extractors. Since state is tracked per schedule, moving a stream from one schedule to another would require you to also move the state from one schedule to the other, which right now would require you to manually modify the

job

table in the system database: https://meltano.com/docs/project.html#system-database I'd be interested in exploring automatic parallelization over different tap processes using "subpipelines" so that we could track the state for each DB's tables in one place while the way the tables are balanced over different subprocesses can change over time. @ken_payne has suggested something like that before as well. Maybe it's time to create an issue, Ken?

edward_smith

04/26/2021, 3:53 PM

You got my situation just right. It seems like an extractor could have a "max_db_connections" setting and then I could just put all of my tables in that one extractor and it would round-robin processing each table in turn.

edward_smith

04/26/2021, 3:53 PM

Thank you for the info on migrating the state... hopefully I won't face it 🙂

edward_smith

04/26/2021, 3:58 PM

Offhand, do you know if an extractor with multiple tables to sync will round-robin batches or does it try to fully sync each table before moving onto the next?

douwe_maan

04/26/2021, 3:59 PM

@edward_smith It depends on the specific tap, but typically they just naively loop over the streams/tables and handle them one by one

edward_ryan

04/26/2021, 4:00 PM

@douwe_maan wrong Edward 😉

douwe_maan

04/26/2021, 4:00 PM

Sorry about that 😄

ken_payne

04/26/2021, 7:18 PM

Thanks for the nudge @douwe_maan - issue here. @edward_smith I would appreciate your thoughts and comments on it too. Especially your experience and requirements around connection pooling, and more generally taking upstream resource constraints into account when making pipeline sharding decisions 😅

Open in Slack

Previous Next