What s the syntax for overriding <https github com meltano s Meltano #troubleshooting

What's the syntax for overriding <this> batch size...

daniel_luftspring

07/26/2023, 11:00 PM

What's the syntax for overriding this batch size config in target-snowflake? I've tried

Copy code

config:  
  batch_config:
    batch_size: <number>

but that doesn't seem to do it

taylor

07/27/2023, 2:16 AM

That default batch size is only configurable when you’re developing a target. We have https://github.com/meltano/sdk/issues/1626 to track making it configurable and are targeting it for a v1 release. The quick option (for now) would be for you to fork the connector and change that directly.

daniel_luftspring

07/27/2023, 4:00 PM

Do you have a rough idea of when v1 release is going to happen?

taylor

07/27/2023, 4:10 PM

No firm date beyond we’re aiming for around the end of Q3 (september). https://github.com/meltano/product/issues/1 is our issue for tracking the relevant 1.0 issues cc @edgar_ramirez_mondragon

edgar_ramirez_mondragon

07/27/2023, 4:12 PM

Yeah, someone from the community is actually trying to tackle this (#1876).

daniel_luftspring

07/27/2023, 5:40 PM

Before I go through the fork route is there a reason that this line isn't reading

batch_config

from the project yaml? It seems like it should. It's just not clear why overriding the default config in that file is going to behave differently than just supplying it from

meltano.yml

user

07/27/2023, 6:03 PM

Oversight 😅. We need to append the batch config schema to the target's own config if the target has that capability. See https://github.com/meltano/sdk/issues/1879 and https://github.com/MeltanoLabs/target-snowflake/issues/99

andrew_merton

07/31/2023, 1:56 AM

We are using batch_size_rows successfully with tap-salesforce-transferwise to change the batch size going into Snowflake via the S3 bucket. I think the batch_config.batch_size is the size of the batches sent between tap and target, and only really used by the tap (the target gets a batch and has to process all the rows)? From the batch Messages page: "When local storage is used, targets do no require special configuration to process

BATCH

messages." Please correct me if I'm wrong :)

daniel_luftspring

07/31/2023, 7:34 PM

Interesting - where exactly are you setting

batch_size_rows

? A tap shouldn't be writing anything to a file system as its only job is to emit state, record, and schema messages to stdout. The target on the other hand may have to write to a file system as an intermediate step before loading into a sink. In the case of

target-snowflake

I think the default behaviour is to create gzipped json line files of 10k batches, put them into a Snowflake stage, then copy into the target table. You can actually see this happen in real time on your local machine if you run any tap and load to

target-snowflake

andrew_merton

07/31/2023, 8:45 PM

This may be specific to

target-snowflake

(

Transferwise

variant). The loader section is

loaders:

- name: target-snowflake

config:

account: <account id>

default_target_schema: <schema>

file_format: <MELTANO_CSV>

add_metadata_columns: false

batch_size_rows: 100000

But this is defining the size of the files that are staged to S3 and loaded into Snowflake. Note that the default for this parameter is 100,000 rows rather than 10,000. Plugin:

loaders:

- name: target-snowflake

variant: transferwise

pip_url: pipelinewise-target-snowflake

The

batch_config.batch_size

you are referring to I believe is related to the BATCH message which is a new feature still in preview, according to https://meltano-sdk--1876.org.readthedocs.build/en/1876/batch.html? The BATCH message allows a tap to write a file or files (locally by default) to disk, then send a list of file names to the target. For example, if the tap gets compressed CSV format by default (as tap-salesforce does) it could dump the file directly to disk, then the target (Snowflake) could send the file directly to the stage and load it, avoiding all the row-by-row overhead of translating CSV to JSON and back (noting that only gzipped JSONL is defined at present). I am considering learning enough python to build tap-salesforce and target-snowflake that do this, but it's a journey :)

daniel_luftspring

08/01/2023, 9:41 PM

Yeah I don't want to use the pipelinewise variant of

target-snowflake

because it's not built with the modern version of the

singer_sdk

andrew_merton

08/01/2023, 10:07 PM

One comment is that we got way better throughput using transferwise - can't remember details, but it was significant. Note also that loading into Snowflake is known to be most efficient using CSV rather than Parquet or ORC - I assume the same would apply to JSON. https://community.snowflake.com/s/article/How-to-Load-Terabytes-Into-Snowflake-Speeds-Feeds-and-Techniques

2 Views

Open in Slack

Previous Next