Hi Team, Am new to Meltano, am searching for best ...
# troubleshooting
a
Hi Team, Am new to Meltano, am searching for best open source ELT tool for my some of the use cases. I have some query can you guy's help me to clarify that. Thanks, 1. Suppose i want to load 100 tables of data from any database to any database. Should Meltano help me in this? 2. Does Meltano is distributed or does it scale as per the load.? 3. Does it load Huge volume of data.? 4. Does it faster than Apache Spark.? Thanks.
s
Hello @akshay_p I think I would need a bit more context to answer your questions. But I'll give it a try: 1. Yes! Meltano is excellent for moving data from A => B. 2. I'm not sure where you're going with this question. It sounds very much dependent on your use case and your source though. 3. What's a "huge volume of data" for you? 4. Apache Spark per se isn't a data movement tool so I'm not sure what exactly you mean here. Maybe it would be helpful if you give an example of what you're trying to do? Your use case?
a
@Sven Balnojan Thanks for replying.! 🙂 Here want i want to implement. 1. Is Meltano distributed or scalable for on-boarding the from A => B. 2. Assume I have 100 or 1000 tables in my source side with millions of records each, and I want to move or load these tables to my data warehouse. My source could be MySQL, PG, Salesforce, or something else.So is Meltano capable of extracting and loading this volume of data and tables in an efficient way? 3. Can we onboard 100 or 1,000 tables using Meltano? 4. Does Meltano use parallel processing to minimise the data loading time?
s
@aaronsteers mind offering your stance here?
a
Does Meltano use parallel processing to minimise the data loading time?
Short answer: Meltano itself does not manage parallel processing. Extractors and loaders can be written to support parallel loading but not all do. (You can always fork a tap/target and create your own if needed.) Longer answer: Many in the meltano community find it helpful to create 'instances' of the same extractor (using inherits_from) so that they can break large jobs into smaller pieces. For instance, most cases where there are 100s or 1000s of table, there are about 5% of those tables which carry 90-95% of the runtime. Secondly, there are often 2 or 3 tables which (for whatever reason) are 'buggy' and fail a lot due to locks on the table, or timeouts, etc. Putting those into their own instance means that you can freely and quickly retry them (or ignore failures) without blocking the rest of your pipeline.
Assume I have 100 or 1000 tables in my source side with millions of records each, and I want to move or load these tables to my data warehouse. My source could be MySQL, PG, Salesforce, or something else.So is Meltano capable of extracting and loading this volume of data and tables in an efficient way?
Yes, the sync process is designed to be very efficient. For MySQL and PG, and other sources which support batch file outputs, it might be worth looking into the BATCH message spec. Not all systems support this, but more are being added regularly. If/when BATCH is supported, you'll get something very close to 'ideal' performance since those are native bulk operations run against the remote system, which automatically take advantage of parallelism built into those platforms.