MySQL Hello We are using MySql tap to move to a Snowflake t Meltano #plugins-general

MySQL Hello! We are using MySql tap to move to a S...

jean_sahlberg

06/01/2022, 2:02 PM

MySQL Hello! We are using MySql tap to move to a Snowflake target, it works, but I found something strange that I can’t fix, I played with the settings but none of them seems to fix the issue: We have ~500 tables to extract from MySql and load to Snowflake, it’s a long process (takes hours to complete) and it is running in 100k rows batches… the issue we have is that if a table has, for example, 630k rows, we will get 600k rows (6 100k batches) but the last 30k rows will be added to the table only when the sync finishes… Meltano will move to the next table and will start adding the rows and will stop in the same way (just “full batches”), only when the pipeline ends that it will add the missing rows, I don’t know why it is this way… if a table has less than 100k rows, it will be empty until the pipeline ends and adds all the rows.

thomas_briggs

06/01/2022, 2:35 PM

This isn't really a Meltano thing but a target-snowflake thing. As you noted it inserts the rows in batches, and doesn't flush pending batches until the input ends (which is a good thing for performance, really). I'm not aware of any way to change that behavior but I could be wrong.

thomas_briggs

06/01/2022, 2:36 PM

I dealt with this by doing the initial load in chunks - add 10 tables to the select list, run the pipeline, add 10 more tables to the select list, run the pipeline, etc. That gets whole tables pushed over and then keeps them up to date from there. That only works if you're using incremental or log_based replication though I guess.

jean_sahlberg

06/01/2022, 2:40 PM

got it, thanks! yes, that’s what I’m thinking, I will do it in chunks. I took a long time to figure out what was happening to those “missing rows” lol

jean_sahlberg

06/01/2022, 2:43 PM

another question: is it a good practice to have more than one pipeline for the same source/target, as I have 500 tables to sync, and some of the tables I need to have more “up to date” data than others. My idea is to set one pipeline for a small number of tables that runs more frequently and will be quicker (as it has fewer tables) and another one that will do for all the other tables and will run less frequently…

thomas_briggs

06/01/2022, 2:58 PM

I don't know if that's a "best practice" but it certainly sounds logical to me 😉 For us incremental updates run fast enough that it isn't worth separating but I've thought about doing that. I think it should be pretty easy to do by defining a virtual plugin with the inherit_from option and simply overriding the select list.

jean_sahlberg

06/01/2022, 3:01 PM

yep! thank you! that was super helpful!! 🙂

thomas_briggs

06/01/2022, 3:05 PM

Glad I could help!

Open in Slack

Previous Next