SDK SQL - Right now we don't seem to support a bat...
# singer-target-development
v
SDK SQL - Right now we don't seem to support a batch_size config option out of the box, like https://hub.meltano.com/loaders/target-postgres#batch_size_rows-setting . My use case here is pretty silly as I have a specific record failing in a list of ~30 records and I just want to tweak the batch size to 1 so I can find the record that's failing easily. I'll hard code something in for my use case to find this, but I'd imagine this debug option is useful to others? Are we trying to let the tap dictate this instead? Or is this something we should get into the sdk?
a
Thanks for raising, @visch! While this use case is admittedly not in the mainstream for batch size, and perhaps a logging-based solution would be more ideal long-term, I do see advantages to letting batch size be overridden by the user and/or by the developer. To clarify, do you mean when the target uses a SQLSink.process_batch(), or when using BATCH messaging? Regardless of which you are referring to, the feature makes sense. Only how it is implemented/configured would differ between these implementations.
v
I mean
when the target uses a SQLSink.process_batch()
😄
a
Ok, cool. For this, what do you think of an approach that (1) we make it easier for devs to override this and (2) we let target developers add a config option of their own when they want to let users override this?
Another approach would be to wrap this is
batch_config
, but then we either are merging or creating conflicts with the two meanings of "batch" - and I don't know of any obvious solutions right to give sufficient clarity.
This would make a really good office hours topic, I think: how, if, and when to merge or rename/split the two meanings of "batch" in the target SDK. Wdyt?
v
Ahh I get the confusion now. A better configuration name may be
row_flush_count
? I'm bad at names, hmm chatgpt let's go.
Copy code
flush_count_per_row
record_flush_size
single_record_debug_mode
flush_rate_per_row
record_debug_size
per_row_flush_amount
record_flush_count
debug_flush_size
single_record_processing
row_flush_amount
record_flush_rate
row_debug_size
per_record_flush_rate
row_flush_count
record_processing_debug
per_row_debug_size
row_flush_size
record_flush_debug
per_record_processing_debug
row_debug_amount
In
target_base.py
there's a
sink.is_full
check which is currently hardcoded to 10000 from the var in
core.py
we could set this with our config value instead.
message has been deleted
a
Naming is hard 🤓
But yes, an SDK update to give developers control over "is_full" by overriding the integer threshold would be a nice first step. (As of today, you'd have to override "is_full()" completely.)
That doesn't solve exposing it to the users generically though; that'd still be a "per tap" implementation matter
v
If we make it a config variable I think it solves it for us right? This would be at the target
a
Not all targets use batch or bulk processing...
So, yes, but some targets wouldn't need/use it, which is why I'm not sure yet how we'd add generically.
Also, the distinction of whether batch/bulk processing is used happens at the Sink level, once removed from the Target class...
I think there may be a solution... I'm just not sure what it is 🙃
v
Got it, I see what you're saying now. It'd be specific to Sink Batch being implemented (Not the singer message kind of batch 😛 ) Now unifying records would be epic for "backup" purposes like we chatted about in one of the discussions about batch. Not sure if it's worth the head ache, my brain goes to a kafka style system each singer-message is sequentially assigned a number that you can replay in whatever way you want. When you are in "singer-batch" mode it'd skip the overhead required 🫣