How do I improve Meltano performance in a “fan-out...
# troubleshooting
m
How do I improve Meltano performance in a “fan-out” scenario? I’m specifically interested in tap-mongodb+target-snowflake performance. I am hoping that there is a way for me to 10x my current throughput. I am using tap-mongodb (private fork) to open a MongoDB change stream against the cluster - this change stream includes data from ~150 databases and ~800 collections. I need to get this data into a Snowflake table per collection. The single MongoDB change stream (scoped to the entire cluster) is great because it limits impact on the source cluster, it makes the meltano operations simple, it makes meltano state management simple, but it makes this sort of fanout hard. Today, we land all cluster data into a single big Snowflake table during the Meltano job run. Once the Meltano
run
command completes, we call a Snowflake stored procedure that routes the newly loaded records from this big landing table into the collection-specific tables (in database-specific schemas - a record from the
payment_service
database’s`Transaction` collection lands in a
payment_service.transaction
table in Snowflake), and then truncates the big table. This works, but it means data written during a run doesn’t surface in the collection-specific tables (which is where dbt sources are defined) until the job completes. During periods of high platform activity, our meltano jobs take longer and this delay is problematic. We have already increased our target-snowflake max batch size, changed from MERGE to COPY (append-only writes), and changed from JSONL to Parquet format for stage files. These changes have helped but not by a ton. Options we have thought of: • split our one change stream into more than one to gain some parallelization ◦ “database names starting with A-M / database names starting with N-Z” or similar for a naive two-way split ◦ Key off the first character in a UUID field in the document itself (all our documents have a UUID
id
) to split the change stream records into sixteen streams of (approximately) equal size ◦ Define Meltano stream maps for all the database+collection names we know of (so that those records can be split into per-collection streams and landed directly into the correct destination table rather than having to go through the routing process). Use the current routing procedure for unhandled (novel) records only. (I think we could do this by splitting records into streams with names like
<schema>-<table>
per this). We wouldn’t be able to handle everything dynamically because it’s hard. I’m concerned about the overhead of having hundreds of streams in one job if we went this route. • Move away from target-snowflake - perhaps write to S3 ourselves (target-s3?) and use Snowpipe to load data into Snowflake from S3 (we run in AWS GovCloud but our Snowflake account is in AWS Commercial - wondering if it’d be more performant to move the “write to stage” step into GovCloud) Any other ideas we should be thinking about? It doesn’t seem like there are any good ways to do this 😕