Hello everyone, I may have a bit extreme use case ...
# getting-started
g
Hello everyone, I may have a bit extreme use case 🙂 I need to initially load 1.3 billion rows (around 3TB) and continuously sync from MySQL(AWS RDS) to Snowflake. Here is the configuration I'm currently using:
Copy code
plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url: pipelinewise-tap-mysql
    metadata:
      '*':
        replication-method: LOG_BASED
  loaders:
  - name: target-snowflake
    variant: meltanolabs
    pip_url: meltanolabs-target-snowflake
    config:
      hard_delete: false
However, the performance is a bit slow, 1 million rows would take 0.5 hours to load, meaning it will take around 30 days to finish loading the entire table. Is there any way that I can speed it up? We need to reduce to at most a couple of days to be able to load this amount of data. Thanks in advance!
m
I'm not super familiar with tap-mysql but have you tried running it as full_table to get the initial data copied, then switch to log_based after? The documentation even says that log_based requires an initial sync to the snapshots of the tables then use a starting log position
m
You just saved my day (or the past couple of days;-) by publishing your config. It looks like I was missing hard_delete: false in my project. Regarding your issue: What file format are you using in snowflake? CSV or JSON?
g
Hi @mark_estey, thanks for your comment! As I understand it, when using
LOG_BASED
, the tap first checks if there is already such table or if it has synced before, if not, it will start with a full load and remember the bin_log position, once the initial load finishes, it will start syncing data with bin_logs. And I have indeed tried with
full_table
, it does not improve the performance..
m
Hrm well the only other thought I had would be to try a different format like Parquet that might have better compression/throughput to do that initial sync to Snowflake. You could dump the MySQL to Parquet on S3, then attach it as an external stage to Snowflake and copy the tables in...
g
that's what I'm going to try 🙂 I'm thinking about creating a snapshot of the RDS, then recreate these tables in Snowflake, and somehow hack Meltano State to trick Meltano to start loading data with
LOG_BASED
m
Yeah that's the tricky part. I've had to deal with GoldenGate in the past (thankfully never since) and you could just grab a current log position, start a full sync via whatever method, then start the replication from the old log position and it would just overwrite any overlap that happened during the full sync without too many issues.
j
There's been a PR in the works for a few months now https://github.com/meltano/sdk/pull/1876 that may help you once it gets merged; related to how much can be batched in a single go
m
That would be nice, though it's hard to tell in this case if the bottleneck is on the tap or target side. I've wished I could configure batch sizes per stream though, as I had one connector with streams that had both really small and really large row sizes so it was hard to strike a balance in the batch size on the Snowflake target side.
By the way, I realized the other night if you decide to use a different loader for the first initial load (for example, tap-mysql log based to target-parquet then manually stage/load Parquet to Snowflake) and you still use the log based extraction on the tap-mysql config, then when you switch to the direct load you should still be able to take the state/bookmark from the initial load stream and copy it to the new direct one (tap-mysql log based to target-snowflake)