Hey y all I have a data stream that has PDF files attached t Meltano #getting-started

Hey y'all! I have a data stream that has PDF files...

kevin_bullock

02/10/2023, 2:10 AM

Hey y'all! I have a data stream that has PDF files attached to each record, and I'm wondering if it's wise to try to use Meltano to replicate those or if I should take a different approach (like a custom Airflow DAG). Specifically, I need to download the attachments from a REST API (which we're also grabbing the records from), and would like to store them in Azure Blob Storage. Has anyone tackled a use case like this?

kevin_bullock

02/10/2023, 2:13 AM

I can imagine writing a custom extractor that yields records like

Copy code

{
  "filename": "string",
  "content": "base64-encoded data"
}

but I worry about how the pipeline would handle arbitrarily-large field values like that. The files could be on the order of megabytes, and adding base64-encoding overhead increases that.

edgar_ramirez_mondragon

02/10/2023, 3:55 AM

Meltano could handle the message size by adjusting the buffer size, but the

store them in Azure Blob Storage

part would a bit hard to fit in the Singer EL paradigm with existing object storage targets, so you’d probably need a custom target just for that

kevin_bullock

02/10/2023, 4:17 AM

Hmm, and if every message needs to be on a single line I suppose that limits how I could yield the file content in chunks. It'd need to be chunked into separate records that the target would have to stitch back together. Or am I misunderstanding the Singer Tap spec?

2 Views

Open in Slack

Previous Next