I'm trying to figure out how batching works. From...
# singer-taps
d
I'm trying to figure out how batching works. From what I see in the logs, it seems like there's something of the following formats: • singer_sdk.metrics with record count of 92323 from the tap • draining sink due to target sink size of 100000 being full. Is this due to batch size on tap or target? What is it actually doing? • singer_sdk.metrics with record count of 83975 from the target The thing is that none of these numbers match up to _`_SDC_BATCHED_AT`_ value in the target db. How are these numbers related? It seems rather confusing that it would be changing. Also, the INCREMENTAL_STATE doesn't get updated after each batch. Maybe the
_SDC_BATCHED_AT
refers to the when the state gets updated? But how is that being determined? I see various different values for
_SDC_RECIEVED_AT
for the same batch time, so I'm guessing that inserts are happening throughout. If I increase the batch size, it will end up draining due to some 5 minute policy instead, with log messages on records extracted every minute. From what I can tell, records are being loaded row by row. So when the 5 minutes gets hit, what happens in this draining step that seems to also potentially take up to a minute? Is it adding in records since the last log message? It doesn't get further logged before handing it off to the target. Just to be clear, the timing calculation is a different example than the logs above. In that one, it extracts around 40k records per minute, on a table with around 100 columns. Is that kind of time typical? Overall, I think my confusion stems from being unable to match up the numbers on the batches. I'm trying to educate myself on the process to better determine an appropriate batch size to make processing more efficient. I am seeing times of over 10 hours to load 5 million records. But if I do the math on each batch of log messages, it should really be taking 3-4 hours. So it makes me question if there's some overlap in processing or some other inefficiency.
👀 1
l
what tap and target are you using? I think the batching size is depending on target. At least for the targets I've been using. Also, my experience with time to load data is that it is very much dependent on how fast the source database is. I would also check if you get the same number of records in the target database as in the source database and use that to determine that it works as expected instead of looking to much into the logs.
e
Ok, so item by item 😅
How are these numbers related?
Loosely. It might help to imagine the record count metrics in a time series as point values for
records / minute
at that point. Only in aggregate would it make sense to see them as a total count of records for the entire run. Does that make sense? _`_SDC_BATCHED_AT`_ is the timestamp when a record batch is committed from memory to the target system.
d
This is what I have set
Copy code
"TAP_MSSQL_CURSOR_ARRAY_SIZE",
        "TARGET_SNOWFLAKE_BATCH_CONFIG_BATCH_SIZE",
        "TARGET_SNOWFLAKE_BATCH_SIZE_ROWS"
The final result seems correct, but intermediate numbers are just confusing
Only in aggregate would it make sense to see them as a total count of records for the entire run.
Does that mean that the total sum of records for all the source and target logs should add up to the same? I guess there's some offset that gets tracked and they are run independently then?
_`_SDC_BATCHED_AT`_ is the timestamp when a record batch is committed from memory to the target system.
What would the purpose of
_SDC_RECIEVED_AT
be then? That sounds more like the actual time that the record makes it into the target system. Intuitively, batched at seems to imply that it would be the same value if they're part of the same batch, but I see more than the max batch size having the same value, hence adding to the confusion
l
ah okay, I unfortunately don't have experience with those taps and targets so can't comment on any specifics there. But good to head the final result seems to work for you.
👍 1