Hey y'all! I have a data stream that has PDF files...
# getting-started
k
Hey y'all! I have a data stream that has PDF files attached to each record, and I'm wondering if it's wise to try to use Meltano to replicate those or if I should take a different approach (like a custom Airflow DAG). Specifically, I need to download the attachments from a REST API (which we're also grabbing the records from), and would like to store them in Azure Blob Storage. Has anyone tackled a use case like this?
I can imagine writing a custom extractor that yields records like
Copy code
{
  "filename": "string",
  "content": "base64-encoded data"
}
but I worry about how the pipeline would handle arbitrarily-large field values like that. The files could be on the order of megabytes, and adding base64-encoding overhead increases that.
e
Meltano could handle the message size by adjusting the buffer size, but the
store them in Azure Blob Storage
part would a bit hard to fit in the Singer EL paradigm with existing object storage targets, so you’d probably need a custom target just for that
k
Hmm, and if every message needs to be on a single line I suppose that limits how I could yield the file content in chunks. It'd need to be chunked into separate records that the target would have to stitch back together. Or am I misunderstanding the Singer Tap spec?