akshay_p
01/14/2023, 7:04 AMSven Balnojan
01/14/2023, 8:29 AMakshay_p
01/17/2023, 4:33 AMSven Balnojan
01/17/2023, 2:11 PMaaronsteers
01/17/2023, 6:02 PMDoes Meltano use parallel processing to minimise the data loading time?Short answer: Meltano itself does not manage parallel processing. Extractors and loaders can be written to support parallel loading but not all do. (You can always fork a tap/target and create your own if needed.) Longer answer: Many in the meltano community find it helpful to create 'instances' of the same extractor (using inherits_from) so that they can break large jobs into smaller pieces. For instance, most cases where there are 100s or 1000s of table, there are about 5% of those tables which carry 90-95% of the runtime. Secondly, there are often 2 or 3 tables which (for whatever reason) are 'buggy' and fail a lot due to locks on the table, or timeouts, etc. Putting those into their own instance means that you can freely and quickly retry them (or ignore failures) without blocking the rest of your pipeline.
Assume I have 100 or 1000 tables in my source side with millions of records each, and I want to move or load these tables to my data warehouse. My source could be MySQL, PG, Salesforce, or something else.So is Meltano capable of extracting and loading this volume of data and tables in an efficient way?Yes, the sync process is designed to be very efficient. For MySQL and PG, and other sources which support batch file outputs, it might be worth looking into the BATCH message spec. Not all systems support this, but more are being added regularly. If/when BATCH is supported, you'll get something very close to 'ideal' performance since those are native bulk operations run against the remote system, which automatically take advantage of parallelism built into those platforms.