I have a bucket in AWS S3 full of `.zip` and `.gz`...
# getting-started
w
I have a bucket in AWS S3 full of
.zip
and
.gz
files. Each of them will contain
csv
files. I'm wondering if there's a way for Meltano to handle these compressed files prior to using an extractor like
tap-s3-csv
1
v
you could use a utility or a bash script. We've also thought about it some here https://github.com/MeltanoLabs/tap-universal-file Give it a shot!
😎 1
e
☝️ that and also https://hub.meltano.com/extractors/tap-spreadsheets-anywhere/ should handle both S3 and gzip compression
1
😎 1
w
So at present it doesn't handle a compression that involves multiple files, correct?
e
Ah I see. No I don't think zip "balls" are well supported by any tap. I know the tap I linked uses
smart_open
so it might be worth checking if that library supports extracting multiple files from a zip file with something like a glob expression as input.
w
This is excellent! Thanks so much for the help. Thankfully 99% of the files are
.gz
. These extractors are awesome
@visch I'm going to try out both extractors. I think the Spreadsheets-Anywhere one is nice because it's closer in alignment with the
tap-s3-csv
one I already use, but I'm curious to see how
tap-universal-file
would handle those zips. Any good spot I can start a thread if I have questions related to your singer tap? #C06A1MD6A6L?
v
#C06A1MD6A6L works 😄 great to see you try it! I hope it is easier to change and work with but we'll see how it goes for you!
🙌 1
w
Thank you, @visch!
np 1
Hey @Edgar Ramírez (Arch.dev) Right now I'm finding my
tap-spreadsheets-everywhere
just kinda hangs when I try to extract a csv from S3.
Copy code
meltano invoke tap-spreadsheets-anywhere --dev
2024-06-14T18:14:55.056767Z [info     ] Environment 'dev' is active
INFO Generating catalog through sampling.
INFO Found credentials in environment variables.
Any idea what might be happening? I assume this is a fairly common problem related to listing all objects in S3? I've tried to use a very specific search pattern.
e
Have you tried narrowing down the search pattern to match one or at most a few files? That way we can rule issues with listing objects.
w
Got it, just didn't see the
search_prefix
listed notably in the
README.md
🙂 All good
e
Nice!