Hi I am a very new Meltano user trying to do an initial bulk Meltano #troubleshooting

Hi... I am a very new Meltano user, trying to do ...

andrew_merton

07/17/2023, 9:04 PM

Hi... I am a very new Meltano user, trying to do an initial bulk data load from Salesforce to Snowflake (total of 45M rows in 7 tables ranging from 2.5M rows to 8.5M+). I am running Meltano in Docker on an AWS Workspace with 8CPUs and 32GB RAM (upgraded from 4CPU + 16GB). Currently using tap-salesforce (api-type: BULK) to target-snowflake (transferwise variant) Using linux's gkrellm, I have observed that the 8 CPUs will run at only around 20-30% each while data is being extracted. On the 4CPU+16GB workspace I got ~390 rows/second (CPUs all at approximately 50% busy i.e. only 1 of the 2 threads being used); on the 8CPU+32GB workspace I get just under 600 rows/second, if all goes well, but normally around 500 (CPUs at approximately 20-30% busy). On both configurations, TOP shows each of tap-salesforce and target-snowflake running at around 100% of a CPU until a file is to be sent to Snowflake. Once one table reaches 100,000 rows and the stream is to be flushed, ALL the CPUs become idle except one, which gzips the data, sends it to Snowflake and loads it. Once the load is completed, all the CPUs then go back 20-30% busy. My meltano.yml specifies the object columns to be included for each object so we aren't including data we don't need. Am I correct in thinking that the tap and target are using RECORD streaming to pass the data, rather than BATCH? if so - would a BATCH-based tap and target be more effective? Is the cessation of extracts during compression by design? Can it be avoided? Note: Due to environmental issues, we may need to redo this bulk extract at times (e.g. changes to Salesforce configuration etc), so more speed is essential.

3 Views

Open in Slack

Previous Next