Hello everyone, I may have a bit extreme use case ...
# best-practices
g
Hello everyone, I may have a bit extreme use case 🙂 I need to initially load 1.3 billion rows (around 3TB) and continuously sync from MySQL(AWS RDS) to Snowflake. Here is the configuration I'm currently using:
Copy code
plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url: pipelinewise-tap-mysql
    metadata:
      '*':
        replication-method: LOG_BASED
  loaders:
  - name: target-snowflake
    variant: meltanolabs
    pip_url: meltanolabs-target-snowflake
    config:
      hard_delete: false
However, the performance is a bit slow, 1 million rows would take 0.5 hours to load, meaning it will take around 30 days to finish loading the entire table. Is there any way that I can speed it up? We need to reduce to at most a couple of days to be able to load this amount of data. I had one idea which is to create a snapshot of MySQL, and let Snowflake recover from that snapshot file., and then somehow hack the meltano state and point to the right
Binlogs
to read, but my colleagues are so in favor of this...
m
Hi! did you end up coming up with a solution to this? We are forever in search of a way to improve the initial syncs.
g
Hi, We further tested a bunch of configs, we found the following two options that work for us: 1. Use
transferwise
variant for Snowflake loader. The main benefit we gain is that we can explicitly set the
batch_size_rows
. In our environment, we managed to increase the performance from 2m rows per hour to around 5 million per hour, which in most cases are enough.
Copy code
loaders:
  - name: target-snowflake
    variant: transferwise
    pip_url: pipelinewise-target-snowflake
    config:
      hard_delete: false
      file_format: ${TARGET_SNOWFLAKE_FILE_FORMAT}
      add_metadata_columns: True
      parallelism: 2
      batch_size_rows: 300000
2. For the extreme cases, like the one we had with 1.3b rows.. we did some hacks. Basically, we created a snapshot from that SQL database, export to S3, and load the data to Snowflake manually.
Please note, if you take this approach, you'd better let Meltano create the table first, otherwise, if the data types you specified do not match with the Meltano will further use for incremental load, it will crash.
And then, we manually updated the Meltano State, you can do so by either update the something like
max_pk
latest_pk
or you can just point to the log position.