Meltano

Not sure of best place to ask this question, because it stems from troubleshooting a toy example inspired by <https://www.dataduel.co/modern-data-stack-in-a-box-with-duckdb/|modern data stack>. So happy to move this elsewhere.

How could I improve performance of a similar meltano pipeline to load a CSV of ~1M records into a duckdb database? In my pipeline, I used <https://hub.meltano.com/extractors/tap-csv--meltanolabs/|Meltano variant of tap-csv>, and <https://hub.meltano.com/loaders/target-duckdb|jwills target-duckdb>. Pipeline would take ~2hrs to complete. Using duckdb directly to import the same CSV would take no where near as long (seconds to minutes).

This is likely b/c the way the Singer taps work is they serialize each row into a newline delimited JSON record. Since this is based on our SDK we should be able to support BATCH messages which would dramatically speed it up. I opened an issue on that <https://github.com/MeltanoLabs/tap-csv/issues/177> batch message docs --&gt; <https://sdk.meltano.com/en/latest/batch.html>