Matt Menzenski
04/29/2025, 11:02 PMrun
command completes, we call a Snowflake stored procedure that routes the newly loaded records from this big landing table into the collection-specific tables (in database-specific schemas - a record from the payment_service
database’s`Transaction` collection lands in a payment_service.transaction
table in Snowflake), and then truncates the big table. This works, but it means data written during a run doesn’t surface in the collection-specific tables (which is where dbt sources are defined) until the job completes. During periods of high platform activity, our meltano jobs take longer and this delay is problematic.
We have already increased our target-snowflake max batch size, changed from MERGE to COPY (append-only writes), and changed from JSONL to Parquet format for stage files. These changes have helped but not by a ton.
Options we have thought of:
• split our one change stream into more than one to gain some parallelization
◦ “database names starting with A-M / database names starting with N-Z” or similar for a naive two-way split
◦ Key off the first character in a UUID field in the document itself (all our documents have a UUID id
) to split the change stream records into sixteen streams of (approximately) equal size
◦ Define Meltano stream maps for all the database+collection names we know of (so that those records can be split into per-collection streams and landed directly into the correct destination table rather than having to go through the routing process). Use the current routing procedure for unhandled (novel) records only. (I think we could do this by splitting records into streams with names like <schema>-<table>
per this). We wouldn’t be able to handle everything dynamically because it’s hard. I’m concerned about the overhead of having hundreds of streams in one job if we went this route.
• Move away from target-snowflake - perhaps write to S3 ourselves (target-s3?) and use Snowpipe to load data into Snowflake from S3 (we run in AWS GovCloud but our Snowflake account is in AWS Commercial - wondering if it’d be more performant to move the “write to stage” step into GovCloud)
Any other ideas we should be thinking about? It doesn’t seem like there are any good ways to do this 😕