peter_s
03/17/2023, 4:33 PMaaronsteers
03/25/2023, 2:24 AMpeter_s
03/25/2023, 5:45 PMaaronsteers
03/25/2023, 5:56 PMaaronsteers
03/25/2023, 6:02 PMaaronsteers
03/25/2023, 6:05 PMtap-spreadsheets-anywhere > target-snowflake
.peter_s
03/26/2023, 12:31 AMtap-spreadsheets-anywhere > pipelinewise-target-snowflake
flow and have been happy with how much EL development work it’s been saving us, but its slowness on very large (20G) files has become a real problem.
We do use dbt for transformations and tests. Can you say more about how dbt can be used for loading?aaronsteers
03/26/2023, 1:04 AMaaronsteers
03/26/2023, 1:06 AMaaronsteers
03/26/2023, 1:07 AMaaronsteers
03/26/2023, 1:08 AMaaronsteers
03/26/2023, 1:09 AMpeter_s
03/26/2023, 1:24 AMaaronsteers
03/26/2023, 1:32 AMaaronsteers
03/26/2023, 1:34 AMaaronsteers
03/26/2023, 1:38 AMmeltano invoke tap-spreadsheets-anywhere > my-output.jsonl
Then something like this to test the target:
cat my-output.jsonl | meltano invoke target-snowflake
aaronsteers
03/26/2023, 1:39 AMaaronsteers
03/26/2023, 1:42 AMpeter_s
03/26/2023, 2:05 AMaaronsteers
03/27/2023, 3:52 PMI’ve also tried a different S3 tap (I think it was from pipelinewise) and it was a little faster but not by a huge factor. So I figured the tap’s slowness was something inherent to the Singer spec or Singer libraries and didn’t explore it further.Not Singer-specific per se, but any process that reads one record at a time and writes out one record at a time will necessarily be a lot slower than something that is operating at the batch-interface level: where the target system just reads the file directly.
This EL pipeline takes literally 20 times longer than loading the same files using our original process that used Snowflake stages and COPY statements directlyNot surprising to me at all. Whenever compared with a process that does not need to pre-process the files at all because Snowflake is reading those files directly (e.g our Singer BATCH spec, snowflake native COPY operations, Snowpipe, etc.), then yes, something liek 20x improvement is exactly what I'd expect in those cases.
aaronsteers
03/27/2023, 3:57 PMpeter_s
03/28/2023, 1:03 AMaaronsteers
03/28/2023, 1:08 AMpeter_s
03/28/2023, 2:10 AMtap-spreadsheets-anywhere
and pipelinewise-target-snowflake
, adding to the tap an option to read just a few records from the file and pass them to the target for schema validation (which the target already supports), and adding to the target an option to retrieve the original S3 location from the passed records (which the tap already includes as metadata) and load from the original location. It’s definitely not clean design, but the loading is very fast, and I still get schema validation.
So one variant you might consider is to have an option for the tap or target read some of the data for validation purposes, though I don’t know whether it’s possible to do that cleanly enough to make it part of the spec or the SDK.jenna_jordan
04/04/2023, 2:01 PMaaronsteers
04/04/2023, 3:58 PMpeter_s
04/04/2023, 4:05 PMaaronsteers
04/04/2023, 5:03 PMjenna_jordan
04/04/2023, 5:11 PMpeter_s
04/04/2023, 5:35 PMtap-spreadsheets-anywhere
overhead is what made my speed-up only 2x compared to your 6x: You’ve mentioned that tap-spreadsheets-anywhere
may be slow due to smart_open
, and I’ve found that https://github.com/transferwise/pipelinewise-tap-s3-csv is 2-3x faster than record-based tap-spreadsheets-anywhere
, so that could account for the difference (though I realize this isn’t an apples-to-apples comparison to your case).
For further comparison, I got a 20x speedup of an S3->Snowflake flow in a test in which (a hacked version of) the target loads the original CSV files located on S3. But I think it may be worth it to settle for something like 6x, to get the benefits of a JSON-based flow (i.e. abstraction away from the specifics of Snowflake, cleaner implementation of validation if desired, extensibility to non-CSV source files, etc).aaronsteers
04/06/2023, 3:53 PMpeter_s
04/06/2023, 3:58 PMaaronsteers
04/06/2023, 5:13 PM