What's the syntax for overriding <this> batch size...
# troubleshooting
d
What's the syntax for overriding this batch size config in target-snowflake? I've tried
Copy code
config:  
  batch_config:
    batch_size: <number>
but that doesn't seem to do it
t
That default batch size is only configurable when you’re developing a target. We have https://github.com/meltano/sdk/issues/1626 to track making it configurable and are targeting it for a v1 release. The quick option (for now) would be for you to fork the connector and change that directly.
d
Do you have a rough idea of when v1 release is going to happen?
t
No firm date beyond we’re aiming for around the end of Q3 (september). https://github.com/meltano/product/issues/1 is our issue for tracking the relevant 1.0 issues cc @edgar_ramirez_mondragon
e
Yeah, someone from the community is actually trying to tackle this (#1876).
d
Before I go through the fork route is there a reason that this line isn't reading
batch_config
from the project yaml? It seems like it should. It's just not clear why overriding the default config in that file is going to behave differently than just supplying it from
meltano.yml
u
Oversight 😅. We need to append the batch config schema to the target's own config if the target has that capability. See https://github.com/meltano/sdk/issues/1879 and https://github.com/MeltanoLabs/target-snowflake/issues/99
a
We are using batch_size_rows successfully with tap-salesforce-transferwise to change the batch size going into Snowflake via the S3 bucket. I think the batch_config.batch_size is the size of the batches sent between tap and target, and only really used by the tap (the target gets a batch and has to process all the rows)? From the batch Messages page: "When local storage is used, targets do no require special configuration to process
BATCH
messages." Please correct me if I'm wrong :)
d
Interesting - where exactly are you setting
batch_size_rows
? A tap shouldn't be writing anything to a file system as its only job is to emit state, record, and schema messages to stdout. The target on the other hand may have to write to a file system as an intermediate step before loading into a sink. In the case of
target-snowflake
I think the default behaviour is to create gzipped json line files of 10k batches, put them into a Snowflake stage, then copy into the target table. You can actually see this happen in real time on your local machine if you run any tap and load to
target-snowflake
a
This may be specific to
target-snowflake
(
Transferwise
variant). The loader section is
loaders:
- name: target-snowflake
config:
account: <account id>
default_target_schema: <schema>
file_format: <MELTANO_CSV>
add_metadata_columns: false
batch_size_rows: 100000
But this is defining the size of the files that are staged to S3 and loaded into Snowflake. Note that the default for this parameter is 100,000 rows rather than 10,000. Plugin:
loaders:
- name: target-snowflake
variant: transferwise
pip_url: pipelinewise-target-snowflake
The
batch_config.batch_size
you are referring to I believe is related to the BATCH message which is a new feature still in preview, according to https://meltano-sdk--1876.org.readthedocs.build/en/1876/batch.html? The BATCH message allows a tap to write a file or files (locally by default) to disk, then send a list of file names to the target. For example, if the tap gets compressed CSV format by default (as tap-salesforce does) it could dump the file directly to disk, then the target (Snowflake) could send the file directly to the stage and load it, avoiding all the row-by-row overhead of translating CSV to JSON and back (noting that only gzipped JSONL is defined at present). I am considering learning enough python to build tap-salesforce and target-snowflake that do this, but it's a journey :)
d
Yeah I don't want to use the pipelinewise variant of
target-snowflake
because it's not built with the modern version of the
singer_sdk
a
One comment is that we got way better throughput using transferwise - can't remember details, but it was significant. Note also that loading into Snowflake is known to be most efficient using CSV rather than Parquet or ORC - I assume the same would apply to JSON. https://community.snowflake.com/s/article/How-to-Load-Terabytes-Into-Snowflake-Speeds-Feeds-and-Techniques