has anyone used a singer tap to extract gzipped cs...
# singer-tap-development
p
has anyone used a singer tap to extract gzipped csv files from s3? i’ve used the pipelinewise tap-s3-csv for uncompressed files and see this open PR for zip file support but am not sure how i’d go about extending that to support gzip files. from my understanding, the challenge is to replace this call to
get_file_handle()
with something that can take an s3 path and return a decompressed file stream without having to first write the decompressed csv file somewhere. any ideas? this SO answer looks promising but i was scared off by the fact that i don’t really understand what it’s doing lol
v
The cheap/easy way to do this is to use your OS / other executables to get the file in the right format for you. something like 1. aws cp file 2. gunzip 3. tap-csv from the local directory
p
yeah fair enough - i was hoping to avoid an additional steps before the tap is run so that it can just be run with our existing airflow DAG generator instead of a custom DAG with an additional pre-step
v
Hopefully with composable pipelines we can get there with Meltano with a small wrapper for OS functions. Like
meltano run preflight tap-csv target-whatever
with preflight being those commands I just listed 🤷 Could probably do it today as well. I don't know the right approach. That stackoverflow response looks like what you need as well, either way will get you there 🙂
d
Pre-meltano, I think I used smart_open for something similar
p
turns out the gzip library has a
decompress()
method that seems to do the trick and i can pass it a byte stream