Has anyone here has issues using `tap-spreadsheets...
# troubleshooting
i
Has anyone here has issues using
tap-spreadsheets-anywhere
? We have a case where spreadsheet data is provided to us in a non-standard way. Basically, before the actual data begins here are a number of rows that comprise titles, descriptions and other preamble. There are also blank lines. It seems
tap-spreadsheets-anywhere
chokes on blank lines and fails to process any further lines. Any experience or suggestions with this issue?
a
Hi @ian_lewis I've found this kind of junk in the headers (multiple rows of headers, blank lines, comments, etc) to be pretty common too! Here is an example of how to use 'skip_initial' to skip those and supply your own header: https://github.com/Matatika/matatika-ce/blob/main/plugins/extractors/tap-govuk-weekly-road-fuel-prices--matatika.yml NOTE - this is on our fork, but provided you use the latest default variant with this change you'll be ok: https://github.com/ets/tap-spreadsheets-anywhere/commit/379173323ee14f48bc408fd15a35fe581a51e317
i
Thank you @aaron_phethean we will take a look. Very much appreciated! It would be a huge help if the latest version of
tap-spreadsheets-anywhere
had a release, it would save using specific commits 🤷
a
Agreed! Supported taps, with an automated upgrade bump is the dream. I think we are edging closer to this (the 'royal we' as in the whole singer / meltano ecosystem) - but it's hard to say whether this is viable as a business proposition. Would anyone pay a small fee for a tap with 12-months support, regular patch releases, and an upgrade option in meltano?
c
I've been looking at this a bit deeper for @ian_lewis. The original
skip_initial
changes (https://github.com/ets/tap-spreadsheets-anywhere/pull/37) supported skipping over rows with data in them. To skip over blank rows, it looks like the skip needs to be pushed into the Excel `generator_wrapper`: https://github.com/ets/tap-spreadsheets-anywhere/blob/main/tap_spreadsheets_anywhere/excel_handler.py#L9-L32 before the
header_row
is populated. This avoids the
IndexError
raised when this function parses a blank row. I'm thinking of cleaning up my experiment and raising an issue + PR. Although I will check out your links @aaron_phethean to see if there is a cleaner way.
a
Nice one @craig_astill - I think perhaps we circumvented that header blank row problem by supplying the field names. Hope that helps
Copy code
"field_names":["Date","ULSP_per_litre","ULSD_per_litre","ULSP_duty","ULSD_duty","ULSP_vat_pc","ULSD_vat_pc"],
c
field_names
didn't help. The tap blows up during sampling of the file in the discovery phase, instead of later on when
field_names
are used.
a
ah, worth a shot!
c
Busy day, but finally raised: https://github.com/ets/tap-spreadsheets-anywhere/issues/52. Will try to knock up a test PR for people to look at.
I've been digging into: https://github.com/ets/tap-spreadsheets-anywhere/pull/56, to figure out why
zipfile.ZipFile(file_handl)
blows up on an S3 sourced file. Any ideas?
Also saw @Matt Menzenski was helpful, when digging into issues on other slack threads. (Hope you don't mind the ping).
m
I have been added as a maintainer of tap-spreadsheets-anywhere, but I haven’t personally used it for any binary files - only JSONL and CSV
c
Ah, no worries, but thanks for replying.
p
@aaron_phethean:
Would anyone pay a small fee for a tap with 12-months support, regular patch releases, and an upgrade option in meltano?
We’d definitely be interested.