Hi all, I have just started looking at Meltano, no...
# plugins-general
d
Hi all, I have just started looking at Meltano, not sure if this is the right approach. I have a number of applications/DBs, and I want to run some queries, the send the data off to some sftp site. I can see the meltano postgres tap, but how do I get the resulting data to the sftp target?
t
I believe you’d have to pick a target file type (say target-csv) and then pipe those files to the SFTP site. I don’t believe that’s something that can be done with meltano right now, but if you’d like to open an issue about it, it’s definitely a valid use case I’ve seen!
j
Would it make sense to have a sftp target? Even it were calling paramiko or a wrapper for it this would help intergrate this workflow. Does a similar target exist for a snowflake internal stage or azure blob external stage?
v
The snowflake target I believe using staging to do things but I don't use snowflake just zoomed through the source a few times. https://github.com/datamill-co/target-snowflake specifically the S3 section Files are always interesting as there's a format to the file you want (csv, json, xml, etc) , then basically a final
scp
or
aws
,etc etc command that you need to run to finish the job. Loader seems like the right place to take care of that but it's almost like you'd want a library they all share to do the "sftp, ftp, s3, azure blob, etc, etc load" and each target-csv, target-json (file related tap) would use that library to push the file at the end of the job.
I can think of some other use cases like properitary applications that take a csv as an input. Maybe like
bcp
from MSSQL, that you might also want to call. But then you're getting into the territory of just wanting to run arbitrary commands after the loader is finished. Plugins could do that as well 🤷
I have a use case for this as well today. Seems like a target thing more than a meltano thing but thought I'd mention it
SFTP Target is what I need
e
Hey @visch, I was experimenting with file sources in arbitrary file systems with this thing: https://github.com/edgarrmondragon/tap-dbf/pull/1/files. Maybe something like that makes sense for targets too
v
Thanks @edgar_ramirez_mondragon , I'll dive in. You and I should chat some time we seem to be into the same stuff right now!
https://www.pyfilesystem.org/page/index-of-filesystems/ curious stuff @edgar_ramirez_mondragon For a TAP it's almost link any File Format tap would want to use some module / abstraction layer that sites ontop of https://www.pyfilesystem.org/page/index-of-filesystems/ A tap would want this to be before processing so something like a Staging, or Prep before processing the data like normal. And you might want to chain them. Something like 1. sftp://folderstructure 2. If .zip , then unzip 3. Send to regular Tap Another example I've seen on the target side is something like 1. Create your files 2. Zip them up with some kinda of encryption 3. Send them to an SFTP site For a File target it's almost the same idea, except it's a post processing step.
I've seen this go both ways before. Not sure which is better. First way is: Bake this into the tap / target (File system chaining etc) Second way is: Keep the tap/target stupid and put something in front of the tap/target that will give the tap/target a file on the local file system (or in memory) like
pullsftpdata | tap-dbase | target-snowflake
There's an argument for Ansible to step in as these things can get complicated as there's an auth component that can change based on your system type (Things like Kerberos, LDAP, Windows Auth, SSH, yada yada) https://github.com/johanmeiring/ansible-sftp https://github.com/melmorabity/ansible-ftp https://docs.ansible.com/ansible/latest/collections/ansible/windows/win_share_module.html
Reminds me some of Apache Camel as well https://camel.apache.org/camel-quarkus/latest/reference/index.html The space gets kinda deep but there's not a clear winner here for what works long term
e
thanks for your feedback @visch!
For a TAP it's almost link any File Format tap would want to use some module / abstraction layer that sites ontop of pyfilesystem
that's what I was going for with that approach, making the filesystem pluggable via some pip extras and credentials baked into the file path like here for S3
Second way is: Keep the tap/target stupid and put something in front of the tap/target that will give the tap/target a file on the local file system (or in memory)
an orchestrator like airflow helps a lot with that approach, but you end up with bespoke dependencies for your file taps
pullsftpdata | tap-dbase | target-snowflake
composable ftw! (and a relevant meltano issue). I've actually been playing around with a "merge" transform for multi-tenant sources here: https://github.com/edgarrmondragon/singer-playground/tree/main/merge_streams
v
hmm I didn't dive to much into the merge stuff you wrote, due to time today. How does that compare with streammaps?
More comprehensive?
Thanks for the pointers by the way, looks like I'm going to implement fs-ssh for target-csv. PR in 30 min? We'll see how slow I am
Having all kinds of problems with fs.sshfs with using setup.py. No idea what I"m doing wrong @edgar_ramirez_mondragon https://github.com/althonos/fs.sshfs/issues/49 Going to dump this for now dumped way too much time into this