looking across several taps (`tap-csv`, `tap-s3`, ...
# singer-tap-development
b
looking across several taps (
tap-csv
,
tap-s3
,
tap-s3-csv
) and there is quite a bit of duplication related to CSV parsing. i am considering making some improvements to
tap-sftp
, but trying to avoid reinventing that part of the wheel. most taps seem to orient around parsing a file format, but some seem to be more oriented around the transfer mechanism for data files that could be in many different formats. s3, sftp, http are transfer protocols that could all have the same set of file formats that need to be parsed. how is the community thinking about this?
v
Take a look at #C05CNUF699B specifically https://github.com/MeltanoLabs/tap-universal-file
We don't have
sftp
support explicitly added but the library supports it and it shouldn't be a huge lift
b
what do you mean by “the library supports it”?
are you referring to
paramiko
?
have you looked at adding
gpg
decryption support? and/or chaining
gpg -> decompress
?
v
b
decrypt then unzip
v
gpg
is a no right now, but definitely could get added
zip
is a yes right now. No reason we couldn't add that
b
to make sure I am understanding, the tap uses
fsspec
to create a file system mount that is then accessed via standard python file open calls?
v
Would be awesome if you added sftp support 😄 , gpg would be really slick too
b
i can definitely appreciate the separation between transfer protocol and decrypt/extract/parse
v
To add encrypt/decrypt we'd want to abstract this function out a bit https://github.com/MeltanoLabs/tap-universal-file/blob/main/tap_universal_file/files.py#L78 to split the list of files and "chaining" for decryption. Not certain how that'd look exactly as it can get pretty complicated if you allow 1->many files with chains extending arbitrarily If we limited the scope a bit to just "unencrypt" -> "uncompress" -> parse_single_file , we'd be looking better. Not sure how flexible you are thinking
I don't think we'd want to go this way but techincally fsspec supports url chaining https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining I don't see support for encrypt/decrypt but the rest is. I think it gets hard to manage here, but there's definitely room for ideas. Generally I think we should make the "normal" cases work out of the box and easily file.csv.gzip -> file -> data should just work
b
limited scope seems fine
arent we basically just reinventing
sshfs
in python though?
v
That's what this uses 😄
I think, one sec
b
i thought it was paramiko
v
This project has been archived by its developers and is no longer developed. Alternatives include the mount feature of rclone.
v
Looks like you can do either
b
yea. a little scripting around
curl sftp://… | gpg -d | gunzip -c | aws s3 cp - <s3://bucket/>…
is pretty simple too
v
Yep easy to throw that infront of the tap and then let the tap rip from there, just you lose out on nice incremental stuff 🤷
All depends on what you're after, ideally this tap would be good enough it'd handle 80-90% of cases without needing to do that
b
yea. i was considering just replicating 1-1 from sftp to s3 and then using meltano tap from there
you can always check s3 to see if you already have the file
thats the debate in my head
build it into the tap or script upstream
v
I don't think adding sftp would be a tough lift,
gpg
would be a bit more invovled
b
yea. its a pain
v
🤷 rsync it over would work just dsepends on your time frame
b
python-gnupg
is dependent on the gnupg version in the environment anyway, and its a PITA to script the decrypt if you cant control the env
but the env is easy enough to manage if you are using docker