Hello everyone I may have a bit extreme use case slightly sm Meltano #getting-started

Hello everyone, I may have a bit extreme use case ...

guisong_fu

11/01/2023, 7:56 PM

Hello everyone, I may have a bit extreme use case 🙂 I need to initially load 1.3 billion rows (around 3TB) and continuously sync from MySQL(AWS RDS) to Snowflake. Here is the configuration I'm currently using:

Copy code

plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url: pipelinewise-tap-mysql
    metadata:
      '*':
        replication-method: LOG_BASED
  loaders:
  - name: target-snowflake
    variant: meltanolabs
    pip_url: meltanolabs-target-snowflake
    config:
      hard_delete: false

However, the performance is a bit slow, 1 million rows would take 0.5 hours to load, meaning it will take around 30 days to finish loading the entire table. Is there any way that I can speed it up? We need to reduce to at most a couple of days to be able to load this amount of data. Thanks in advance!

mark_estey

11/01/2023, 8:36 PM

I'm not super familiar with tap-mysql but have you tried running it as full_table to get the initial data copied, then switch to log_based after? The documentation even says that log_based requires an initial sync to the snapshots of the tables then use a starting log position

marcin_wojciechowski

11/01/2023, 9:24 PM

You just saved my day (or the past couple of days;-) by publishing your config. It looks like I was missing hard_delete: false in my project. Regarding your issue: What file format are you using in snowflake? CSV or JSON?

guisong_fu

11/02/2023, 8:03 AM

Hi @mark_estey, thanks for your comment! As I understand it, when using

LOG_BASED

, the tap first checks if there is already such table or if it has synced before, if not, it will start with a full load and remember the bin_log position, once the initial load finishes, it will start syncing data with bin_logs. And I have indeed tried with

full_table

, it does not improve the performance..

mark_estey

11/02/2023, 2:02 PM

Hrm well the only other thought I had would be to try a different format like Parquet that might have better compression/throughput to do that initial sync to Snowflake. You could dump the MySQL to Parquet on S3, then attach it as an external stage to Snowflake and copy the tables in...

guisong_fu

11/02/2023, 3:26 PM

that's what I'm going to try 🙂 I'm thinking about creating a snapshot of the RDS, then recreate these tables in Snowflake, and somehow hack Meltano State to trick Meltano to start loading data with

LOG_BASED

mark_estey

11/02/2023, 3:33 PM

Yeah that's the tricky part. I've had to deal with GoldenGate in the past (thankfully never since) and you could just grab a current log position, start a full sync via whatever method, then start the replication from the old log position and it would just overwrite any overlap that happened during the full sync without too many issues.

joshua_janicas

11/02/2023, 5:23 PM

There's been a PR in the works for a few months now https://github.com/meltano/sdk/pull/1876 that may help you once it gets merged; related to how much can be batched in a single go

mark_estey

11/02/2023, 5:29 PM

That would be nice, though it's hard to tell in this case if the bottleneck is on the tap or target side. I've wished I could configure batch sizes per stream though, as I had one connector with streams that had both really small and really large row sizes so it was hard to strike a balance in the batch size on the Snowflake target side.

mark_estey

11/03/2023, 2:34 PM

By the way, I realized the other night if you decide to use a different loader for the first initial load (for example, tap-mysql log based to target-parquet then manually stage/load Parquet to Snowflake) and you still use the log based extraction on the tap-mysql config, then when you switch to the direct load you should still be able to take the state/bookmark from the initial load stream and copy it to the new direct one (tap-mysql log based to target-snowflake)

Open in Slack

Previous Next