Daniel Luo
09/30/2024, 3:46 PM_SDC_BATCHED_AT
refers to the when the state gets updated? But how is that being determined? I see various different values for _SDC_RECIEVED_AT
for the same batch time, so I'm guessing that inserts are happening throughout.
If I increase the batch size, it will end up draining due to some 5 minute policy instead, with log messages on records extracted every minute. From what I can tell, records are being loaded row by row. So when the 5 minutes gets hit, what happens in this draining step that seems to also potentially take up to a minute? Is it adding in records since the last log message? It doesn't get further logged before handing it off to the target. Just to be clear, the timing calculation is a different example than the logs above. In that one, it extracts around 40k records per minute, on a table with around 100 columns. Is that kind of time typical?
Overall, I think my confusion stems from being unable to match up the numbers on the batches. I'm trying to educate myself on the process to better determine an appropriate batch size to make processing more efficient. I am seeing times of over 10 hours to load 5 million records. But if I do the math on each batch of log messages, it should really be taking 3-4 hours. So it makes me question if there's some overlap in processing or some other inefficiency.Love Eklund
09/30/2024, 6:21 PMEdgar Ramírez (Arch.dev)
09/30/2024, 6:23 PMHow are these numbers related?Loosely. It might help to imagine the record count metrics in a time series as point values for
records / minute
at that point. Only in aggregate would it make sense to see them as a total count of records for the entire run. Does that make sense?
_`_SDC_BATCHED_AT`_ is the timestamp when a record batch is committed from memory to the target system.Daniel Luo
09/30/2024, 6:23 PM"TAP_MSSQL_CURSOR_ARRAY_SIZE",
"TARGET_SNOWFLAKE_BATCH_CONFIG_BATCH_SIZE",
"TARGET_SNOWFLAKE_BATCH_SIZE_ROWS"
The final result seems correct, but intermediate numbers are just confusingDaniel Luo
09/30/2024, 6:28 PMOnly in aggregate would it make sense to see them as a total count of records for the entire run.Does that mean that the total sum of records for all the source and target logs should add up to the same? I guess there's some offset that gets tracked and they are run independently then?
_`_SDC_BATCHED_AT`_ is the timestamp when a record batch is committed from memory to the target system.What would the purpose of
_SDC_RECIEVED_AT
be then? That sounds more like the actual time that the record makes it into the target system. Intuitively, batched at seems to imply that it would be the same value if they're part of the same batch, but I see more than the max batch size having the same value, hence adding to the confusionLove Eklund
09/30/2024, 6:43 PM