guisong_fu
10/31/2023, 3:39 PMplugins:
extractors:
- name: tap-mysql
variant: transferwise
pip_url: pipelinewise-tap-mysql
metadata:
'*':
replication-method: LOG_BASED
loaders:
- name: target-snowflake
variant: meltanolabs
pip_url: meltanolabs-target-snowflake
config:
hard_delete: false
However, the performance is a bit slow, 1 million rows would take 0.5 hours to load, meaning it will take around 30 days to finish loading the entire table.
Is there any way that I can speed it up? We need to reduce to at most a couple of days to be able to load this amount of data.
I had one idea which is to create a snapshot of MySQL, and let Snowflake recover from that snapshot file., and then somehow hack the meltano state and point to the right Binlogs to read, but my colleagues are so in favor of this...matt_cooley
11/28/2023, 9:41 PMguisong_fu
11/29/2023, 6:35 AMtransferwise variant for Snowflake loader. The main benefit we gain is that we can explicitly set the batch_size_rows . In our environment, we managed to increase the performance from 2m rows per hour to around 5 million per hour, which in most cases are enough.
loaders:
- name: target-snowflake
variant: transferwise
pip_url: pipelinewise-target-snowflake
config:
hard_delete: false
file_format: ${TARGET_SNOWFLAKE_FILE_FORMAT}
add_metadata_columns: True
parallelism: 2
batch_size_rows: 300000
2. For the extreme cases, like the one we had with 1.3b rows.. we did some hacks. Basically, we created a snapshot from that SQL database, export to S3, and load the data to Snowflake manually. Please note, if you take this approach, you'd better let Meltano create the table first, otherwise, if the data types you specified do not match with the Meltano will further use for incremental load, it will crash. And then, we manually updated the Meltano State, you can do so by either update the something like max_pk latest_pk or you can just point to the log position.