SDK SQL Right now we don t seem to support a batch size conf Meltano #singer-target-development

SDK SQL - Right now we don't seem to support a bat...

visch

01/11/2023, 7:50 PM

SDK SQL - Right now we don't seem to support a batch_size config option out of the box, like https://hub.meltano.com/loaders/target-postgres#batch_size_rows-setting . My use case here is pretty silly as I have a specific record failing in a list of ~30 records and I just want to tweak the batch size to 1 so I can find the record that's failing easily. I'll hard code something in for my use case to find this, but I'd imagine this debug option is useful to others? Are we trying to let the tap dictate this instead? Or is this something we should get into the sdk?

aaronsteers

01/12/2023, 4:45 AM

Thanks for raising, @visch! While this use case is admittedly not in the mainstream for batch size, and perhaps a logging-based solution would be more ideal long-term, I do see advantages to letting batch size be overridden by the user and/or by the developer. To clarify, do you mean when the target uses a SQLSink.process_batch(), or when using BATCH messaging? Regardless of which you are referring to, the feature makes sense. Only how it is implemented/configured would differ between these implementations.

visch

01/13/2023, 3:47 PM

I mean

when the target uses a SQLSink.process_batch()

😄

aaronsteers

01/13/2023, 6:47 PM

Ok, cool. For this, what do you think of an approach that (1) we make it easier for devs to override this and (2) we let target developers add a config option of their own when they want to let users override this?

aaronsteers

01/13/2023, 6:48 PM

Another approach would be to wrap this is

batch_config

, but then we either are merging or creating conflicts with the two meanings of "batch" - and I don't know of any obvious solutions right to give sufficient clarity.

aaronsteers

01/13/2023, 6:50 PM

This would make a really good office hours topic, I think: how, if, and when to merge or rename/split the two meanings of "batch" in the target SDK. Wdyt?

visch

01/13/2023, 8:02 PM

Ahh I get the confusion now. A better configuration name may be

row_flush_count

? I'm bad at names, hmm chatgpt let's go.

Copy code

flush_count_per_row
record_flush_size
single_record_debug_mode
flush_rate_per_row
record_debug_size
per_row_flush_amount
record_flush_count
debug_flush_size
single_record_processing
row_flush_amount
record_flush_rate
row_debug_size
per_record_flush_rate
row_flush_count
record_processing_debug
per_row_debug_size
row_flush_size
record_flush_debug
per_record_processing_debug
row_debug_amount

target_base.py

there's a

sink.is_full

check which is currently hardcoded to 10000 from the var in

core.py

we could set this with our config value instead.

visch

01/13/2023, 8:02 PM

message has been deleted

aaronsteers

01/13/2023, 9:06 PM

Naming is hard 🤓

aaronsteers

01/13/2023, 9:07 PM

But yes, an SDK update to give developers control over "is_full" by overriding the integer threshold would be a nice first step. (As of today, you'd have to override "is_full()" completely.)

aaronsteers

01/13/2023, 9:10 PM

That doesn't solve exposing it to the users generically though; that'd still be a "per tap" implementation matter

visch

01/13/2023, 9:11 PM

If we make it a config variable I think it solves it for us right? This would be at the target

aaronsteers

01/13/2023, 9:13 PM

Not all targets use batch or bulk processing...

aaronsteers

01/13/2023, 9:16 PM

So, yes, but some targets wouldn't need/use it, which is why I'm not sure yet how we'd add generically.

aaronsteers

01/13/2023, 9:17 PM

Also, the distinction of whether batch/bulk processing is used happens at the Sink level, once removed from the Target class...

aaronsteers

01/13/2023, 9:18 PM

I think there may be a solution... I'm just not sure what it is 🙃

visch

01/13/2023, 9:22 PM

Got it, I see what you're saying now. It'd be specific to Sink Batch being implemented (Not the singer message kind of batch 😛 ) Now unifying records would be epic for "backup" purposes like we chatted about in one of the discussions about batch. Not sure if it's worth the head ache, my brain goes to a kafka style system each singer-message is sequentially assigned a number that you can replay in whatever way you want. When you are in "singer-batch" mode it'd skip the overhead required 🫣

Open in Slack

Previous Next