Hi all. Has anyone had to extract data from a zip...
# plugins-general
a
Hi all. Has anyone had to extract data from a zip that contains lots of csv files? Spreadsheets-anywhere looks like it can handle a gzip of a single file, but not a zip of many files. Here's what I'm thinking:
meltano run files-utility tap-spreadsheets-anywhere target-postgres
files-utility: this is the utility I want that would download the files and decompress to a local directory structure tap-spreadsheets-anywhere: this would work as is with a local file path + pattern Anyone have any other clever ideas?
a
Pyfilesystem supports zip sources. In theory, tap-spreadsheets-anywhere could get support for zip files via that library.
# Zip file
with open_fs('zip://projects.zip') as fs:
print(count_python_loc(fs))
a
Thanks @aaronsteers tap-spreadsheets-anywhere already supports the compression (gzip and bzip2). The .zip file seems to be found ok and downloaded. What it can't seem to handle is a bunch of files inside that zip because the config for the tap is a pattern for the top level file.
a
Yeah. I totally get that, @aaron_phethean. Zip files containing multiple data sources would need to basically be treated as multi-sheet directories in themselves.
So, catalog generation from a zip would in theory generate multiple streams.
But to the direction you mention, yeah, a custom utility that can download and extract files would be really cool. Riffing on your example, a custom 'get-sheets' and 'unzip-sheets' command could be written against a generic
files-util
.
meltano run files-util:get-sheets files-util:unzip-sheets tap-spreadsheets-anywhere target-postgres
a
'files-util:get-sheets' - I like that. If I got you correctly on the first idea, this would mean enhancing the tap to handle the zips as nested directories. That could be quite transparent, my initial thought was that the configuration would become hugely bloated
a
Untitled.yaml
There might be some bugs in the above, but in theory, this could be a plugin definition that relies on already-installed OS-level tools.
a
yeah, cool - that might work as is with just the plugin definition! I was thinking about something more complex like a files util that just moved files, also depending on smart open https://github.com/RaRe-Technologies/smart_open
a
Yeah, that'd be a great investment because it makes everything more portable and concicely configurable 👍
My example is more of a POC and not as portable/powerful
A fun problem for sure!
a
Great riffing with you. Cheers!
a
Ditto 🍻 😅