I writing a tap for reading OpenOffice Calc (ODS) ...
# singer-taps
r
I writing a tap for reading OpenOffice Calc (ODS) files from a versioned S3 bucket. I know there's tap-spreadsheets-anywhere but it neither supports ODS nor versioned buckets, it's architecturally incompatible with pyexcel and is unmaintained. The ODS files have multiple sheets, but the same schema across files, so all files produce the same streams and all streams span across multiple files. Naturally the tap wouldn't produce records stream by stream, but file by file, with streams intertwined. That's not how the singer-sdk want's me to do things. Is this mode supported by the sdk (I'm assuming that Meltano will be fine with it)? Is there an example tap I could look at?
a
How large/numerous are the files in question here? Could you feasibly iterate over the whole file list when you run each stream? So the 'sales' stream iterates over all the files, emitting all/any data in a tab called 'sales'? Then the same process happens for the 'refunds' stream etc?
r
almost 2000 files right now my problem is that the existing tap either leaks memory or hangs on some bad regex (depending on ods plugin), so I'm not sure if loading 2000 files is large operation or not. Not at the moment. Also under normal circumstances, only the latest version or two is loaded, except for the initial load
a
What is the 'existing tap' here? One you have built or something from the hub?
r
tap-spreadsheets-anywhere, from the hub, the recommended one
I was hoping to get the most efficient operation by
Copy code
for version in versions(path):
  for sheet in version
    for row in sheet:
      write(row)
that tap uses the original singer sdk, not meltano's
a
Yep that would be the best way, but I see what you mean, iterating over the files is orthogonal to iterating over the streams. Are the files small enough to read into memory or temporary json files as the first operation before running the streams?
r
tap-spreadsheets-anywhere gets killed by oom killer, but it looks like it's because it doesn't release XML-DOM representation of the documents. reading the spreadsheets as streams and storing a dict[str, list[dict]] might work
a
I would try to whip up a minimal proof of concept for the two variants, see if there are any memory issues etc as you scale.
r
reading all to memory, and what's the other variant
would it work to make a stream of files, and each sheet a child stream?
a
1. read all to memory 2. read all to local temporary file store to reduce calls to your s3 bucket
would it work to make a stream of files, and each sheet a child stream?
Yes I think that's possible, but easier if you want to define your childs (tabs) statically.
If you really have identically structured files then I would go that way, with a
Files
stream (that you aren't really interested in the data from), and then child streams for each of your tab names.
If your files are small enough / have a in-memory representation then pass it to the child in the context. What you are trying to avoid is re-reading each file in full just to invoke each of the child streams.
Sorry if I am stating the obvious here!