Hello Meltano people, Please share some best pract...
# troubleshooting
p
Hello Meltano people, Please share some best practices to speed up or debug performance issues. I have now tested meltano copying between oracle and oracle:
meltano run tap-oracle target-oracle
I have tested most basic data types (VARCHAR, DATE, various numeric types), and tested tables small and mid-sized: 30 columns and a few million rows. My test are not complete for all cases: most my tables are small with an average of 1 KB of data per row. I have also successfully tested a table with small CLOBs (CLOB is an UTF8 string that can potentially be very long, it is similar to long VARCHAR). Observation: the tests are reasonable however in general a bit slow. I have observed: between 200 and 2500 rows per second, and 1 MB/s. That second number is independent of table structure. All my tables are copied at 1 MB/s speed. This seems very slow. During the run, I see: rather low CPU usage, sometimes reaching 100% but usually below 10%, and very low memory usage. I also see that Meltano is pushing data to the target at batches of 10,000 rows. I could not find a way if this or other parameters could be configurable. To compare, our old SAS ETL is twice faster in most cases. What could be the cause for this slow speed, how to debug it and how to possibly change some parameters to experiment? Then I tested a table with large CLOB column: each record could reach even a few megabyte of data. This was a complete failure: The entire ETL (meltano run tap-oracle target-oracle ) took 30 minutes to push the first batch of 10,000 rows. [update: I later checked that it was slow just because there was more data, while the average copying speed of 1 MB/s was still maintained. So in fact, copying clobs was not slower than other data types]While this could possibly be reasonable, it would be good to limit the batch size. I noted the count was 0, and then after some 25 mins it jumped to 10,000. I think for large records *it would be reasonable to enforce small batches. How? I*s this parameter down to sqlalchemy, or could I configure it at meltano level? Hint
h
I have the same problem with the SQL Server tap. I achieve around 50K-300K rows/s, depending on the table, with around 100% CPU usage. Unfortunately, AirByte was much worse for me at 7 minutes for 80K rows initially, then 2 minutes after that using CDC. It's all very bad, but at least Meltano hasn't made any performance claims as far as I'm aware. I suspect the first issue is simply the use of Python. One test I ran compared the ingestion of 80K rows using PyMSSQL vs Java ODBC. 2+ seconds vs milliseconds, huge difference. Then Meltano and taps may be doing any amount of additional processing also in python. Massive difference. The second issue probably the pointless type conversions. Broadly speaking, actual types are converted to JSON types like Number which are general, meaning types like. int8/16/32 etc are converted to int64, floats to doubles etc. This slows down ingestion, then uses more space on the target, and then makes querying the target slower because of the increased data size. Worse, it can create functional problems. I've had to implement multiple workarounds and get multiple issues resolved with the tap. I have it worse because my tap also converts to PARQUET first adding another layer of type and performance problems, but that's a matter of writing my own tap for my specific needs, so not really a Meltano problem. I highly advise writing your own tooling, I would if I could go back in time, and I will get around to it. It will be at least a magnitude faster, more reliable, and more flexible, and easier to implement. Unfortunately, the modern software world does not care much about performance so you'll have to do it yourself. If you end up using Meltano anyway, you'll also want to multiprocess it (Python makes actual threading hard for no good reason on top of all it's other performance problems), and that involves a lot of work because you basically have to undo all the caching that Meltano does. Even then you run into issues like systemdb locking errors between workers because of SQLite limitations, so you would then need to generate multiple projects and so on... not a good experience.
p
Thank you @Hayden Ness. This indeed sheds some light. However, I think I have additional issues, different from yours. You say you copy 50K-300K rows, but in my case it is 2K rows, at constant speed of 1 MB/s, which is a joke. This average speed is independent from type of table copied (small, big, varchar, clob, date, integer) and independent from my location. However, I established that: with most data types, it is tap-oracle (the reading speed at source) that limits the process at 1 MB/s. Ths is always true xxcept clob data type, in which case reading is fast, but it is the target-oracle (the writing speed at destination) that limits the process, again, at 1 MB/s. In my case, processor is idle most of the time. So I do not think it is to do with anything like type conversions, which should show CPU usage. I am sure I am getting blocked at some I/O level, possibly by some buffering parameter either by operating system, or by python, or by some environment variables, or meltano plugins.