Hello everyone I may have a bit extreme use case slightly sm Meltano #best-practices

Hello everyone, I may have a bit extreme use case ...

guisong_fu

10/31/2023, 3:39 PM

Hello everyone, I may have a bit extreme use case 🙂 I need to initially load 1.3 billion rows (around 3TB) and continuously sync from MySQL(AWS RDS) to Snowflake. Here is the configuration I'm currently using:

Copy code

plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url: pipelinewise-tap-mysql
    metadata:
      '*':
        replication-method: LOG_BASED
  loaders:
  - name: target-snowflake
    variant: meltanolabs
    pip_url: meltanolabs-target-snowflake
    config:
      hard_delete: false

However, the performance is a bit slow, 1 million rows would take 0.5 hours to load, meaning it will take around 30 days to finish loading the entire table. Is there any way that I can speed it up? We need to reduce to at most a couple of days to be able to load this amount of data. I had one idea which is to create a snapshot of MySQL, and let Snowflake recover from that snapshot file., and then somehow hack the meltano state and point to the right

Binlogs

to read, but my colleagues are so in favor of this...

matt_cooley

11/28/2023, 9:41 PM

Hi! did you end up coming up with a solution to this? We are forever in search of a way to improve the initial syncs.

guisong_fu

11/29/2023, 6:35 AM

Hi, We further tested a bunch of configs, we found the following two options that work for us: 1. Use

transferwise

variant for Snowflake loader. The main benefit we gain is that we can explicitly set the

batch_size_rows

. In our environment, we managed to increase the performance from 2m rows per hour to around 5 million per hour, which in most cases are enough.

Copy code

loaders:
  - name: target-snowflake
    variant: transferwise
    pip_url: pipelinewise-target-snowflake
    config:
      hard_delete: false
      file_format: ${TARGET_SNOWFLAKE_FILE_FORMAT}
      add_metadata_columns: True
      parallelism: 2
      batch_size_rows: 300000

2. For the extreme cases, like the one we had with 1.3b rows.. we did some hacks. Basically, we created a snapshot from that SQL database, export to S3, and load the data to Snowflake manually.

Please note, if you take this approach, you'd better let Meltano create the table first, otherwise, if the data types you specified do not match with the Meltano will further use for incremental load, it will crash.

And then, we manually updated the Meltano State, you can do so by either update the something like

max_pk

latest_pk

or you can just point to the log position.

7 Views

Open in Slack

Previous Next