Has anyone done any performance tuning on taps? I ...
# singer-taps
t
Has anyone done any performance tuning on taps? I have a table in MySQL with 1M rows in it. Dumping that to a text file with the MySQL CLI tool takes about 3 seconds. Running the tap manually (e.g. tap-mysql -c config.json --catalog ...) and piping the output to a file take 2.5 minutes. 😮 I profiled the tap with cProfile and 40% of the time is in cursors.fetchone and 30% of the time is in messages.write_message. I know next to nothing about Python though, so... I don't have any idea how to optimize any of that. I think I know enough about Meltano to dump the table manually, load it, and set the right state in the meltano.db, but... that's a little scary.
t
@visch has done quite a bit I think
t
Thanks. I was able to improve the performance of the tap by about 30% by using cursor.fetchmany instead of cursor.fetchone. I got a bit more improvement by switching to a newer version of Pipelinewise's support lib, which uses a different JSON library that's faster. Cutting the runtime almost by half feels good but it's still not even close to what's necessary for moving large tables. 😞 So now I guess I understand why PW built the "fast sync" thing. Ultimately I don't think I have much choice but to dump the tables to CSV with other tools and fake the state data so Meltano can handle incremental loads after that. 😕
v
I haven't done it specifically with that tap, but yes. There's a good writeup by Edgar on sdk.meltano.com on how to look into performance When looking at the execution time for me it has came down to the per record processing time, unfortunately it's not the network stack as you'd expect. Almost always disabling json schema validation, got me faster than the network. If network is extremely fast between the two you'll have issues. Batch works by making the number of ops python has to do much less and by allowing your target to do some optimizing for you. If you are after a 2-4x speed boost and that's enough, looking at pypy might be interesting I haven't looked myself!
Few other tricks I've used is testing to see whether it's the tap or target slowing you down by passing a file into the target directly. You can also try a queue between the two systems that's been nice for me as if the queue starts piling up between the tap and target I know my issue with slowness is the target
Generally my slowness has been target based but for full db pulls most people have gone with batch I just haven't needed to fwiw
t
Thanks @visch. My issue is definitely the tap. I've been testing by running the tap directly, in fact, so there's no target (or even Meltano) involved. Can you say more about disabling schema validation? I think I've seen options for that in targets but not in taps. (And I'm not sure it makes any sense there...)
v
Instead of starting at json validation definitely start at something like https://sdk.meltano.com/en/latest/dev_guide.html#testing-performance so you know it's actually a problem that's slowing you down
Gains may vary as it really depends on the usage pattern 🤷 for you the batch stuff from pipeline wise already works so you might want to go there but if you want to look at performance stuff I'd do what I just posted
t
I've been using cProfile to profile the tap and snakeviz to view the results. That's where I got the numbers in my original comment about 40% of the time being in fetchone and 30% of the time being in write_messages. 😉 Which is fairly logical, really... reading results from the DB and formatting them into the singer format should be most what the tap does. It's just too damn slow about it all to be practical for large tables, apparently. 🤷🏻 At this point I'm guessing that's a Python problem given what I've read about special C libraries for speeding up JSON formatting in Python. Thankfully this is only a problem for us for initial loads so I think I can work around it. I do have to imagine this is a blocker for anyone with high change volume though.