I writing a tap for reading OpenOffice Calc ODS files from a Meltano #singer-taps

I writing a tap for reading OpenOffice Calc (ODS) ...

Rafał

06/02/2025, 1:19 PM

I writing a tap for reading OpenOffice Calc (ODS) files from a versioned S3 bucket. I know there's tap-spreadsheets-anywhere but it neither supports ODS nor versioned buckets, it's architecturally incompatible with pyexcel and is unmaintained. The ODS files have multiple sheets, but the same schema across files, so all files produce the same streams and all streams span across multiple files. Naturally the tap wouldn't produce records stream by stream, but file by file, with streams intertwined. That's not how the singer-sdk want's me to do things. Is this mode supported by the sdk (I'm assuming that Meltano will be fine with it)? Is there an example tap I could look at?

Andy Carter

06/02/2025, 2:28 PM

How large/numerous are the files in question here? Could you feasibly iterate over the whole file list when you run each stream? So the 'sales' stream iterates over all the files, emitting all/any data in a tab called 'sales'? Then the same process happens for the 'refunds' stream etc?

Rafał

06/02/2025, 2:30 PM

almost 2000 files right now my problem is that the existing tap either leaks memory or hangs on some bad regex (depending on ods plugin), so I'm not sure if loading 2000 files is large operation or not. Not at the moment. Also under normal circumstances, only the latest version or two is loaded, except for the initial load

Andy Carter

06/02/2025, 2:31 PM

What is the 'existing tap' here? One you have built or something from the hub?

Rafał

06/02/2025, 2:31 PM

tap-spreadsheets-anywhere, from the hub, the recommended one

Rafał

06/02/2025, 2:33 PM

I was hoping to get the most efficient operation by

Copy code

for version in versions(path):
  for sheet in version
    for row in sheet:
      write(row)

Rafał

06/02/2025, 2:35 PM

that tap uses the original singer sdk, not meltano's

Andy Carter

06/02/2025, 2:47 PM

Yep that would be the best way, but I see what you mean, iterating over the files is orthogonal to iterating over the streams. Are the files small enough to read into memory or temporary json files as the first operation before running the streams?

Rafał

06/02/2025, 2:52 PM

tap-spreadsheets-anywhere gets killed by oom killer, but it looks like it's because it doesn't release XML-DOM representation of the documents. reading the spreadsheets as streams and storing a dict[str, list[dict]] might work

Andy Carter

06/02/2025, 3:22 PM

I would try to whip up a minimal proof of concept for the two variants, see if there are any memory issues etc as you scale.

Rafał

06/02/2025, 3:23 PM

reading all to memory, and what's the other variant

Rafał

06/05/2025, 10:09 AM

would it work to make a stream of files, and each sheet a child stream?

Andy Carter

06/06/2025, 9:06 AM

1. read all to memory 2. read all to local temporary file store to reduce calls to your s3 bucket

Andy Carter

06/06/2025, 9:08 AM

would it work to make a stream of files, and each sheet a child stream?

Yes I think that's possible, but easier if you want to define your childs (tabs) statically.

Andy Carter

06/06/2025, 9:10 AM

If you really have identically structured files then I would go that way, with a

Files

stream (that you aren't really interested in the data from), and then child streams for each of your tab names.

Andy Carter

06/06/2025, 9:11 AM

If your files are small enough / have a in-memory representation then pass it to the child in the context. What you are trying to avoid is re-reading each file in full just to invoke each of the child streams.

Andy Carter

06/06/2025, 9:12 AM

Sorry if I am stating the obvious here!

Open in Slack

Previous Next