ryan_whitten
01/05/2022, 10:05 PMmax_size = sys.maxsize
, but I'm trying to avoid loading all records into memory. Is there any way to get the batch records to arrive as a generator instead of just using the list of context["records"]
? Then I could process a single batch, writing to the temp file, and finally do my upload all within process_batch
.
• If not, are there any other mechanisms to do something custom after the input stream is exhausted? That would let me process smaller batches and append to the same file each time, finally doing the upload at the end. The only obvious thing seems to override _process_endofpipe
but I'd rather avoid overriding a private method.
Thanks in advance!pat_nadolny
01/05/2022, 10:30 PMstart_batch
where I create a temp file and add it to the context for later, then override process_record
to append each record in the batch to that temp file which I can find by pulling its path from the context, then override process_batch
to just move the file to its destination once the max_size
is reached. It just appends to a file on disk rather than a list in memory. Does that sound like it would work for you?ryan_whitten
01/06/2022, 1:40 AMpat_nadolny
01/06/2022, 2:23 AMryan_whitten
01/06/2022, 2:02 PMpat_nadolny
01/06/2022, 3:56 PMryan_whitten
01/06/2022, 7:56 PMfred_reimer
01/06/2022, 8:44 PMryan_whitten
01/11/2022, 5:41 PMprocess_record
(with a try/yield/except to always close the file in case of an error), and write the records to it. Not having a temp file means it's not constrained by disk space, and the only memory constraint is how many records I'm internally buffering before write. Pretty cool way to send a huge amount of data and sink to S3!