Does someone here use the sftp-tap? I am trying to...
# singer-taps
p
Does someone here use the sftp-tap? I am trying to get it working with https://ftp.ncbi.nlm.nih.gov/gene/DATA/ but I am unable to. Are there examples with public ftp servers? Thanks!
v
sftp != ftp I'd guess is your issue
👍 1
h
I was going to suggest exactly that. Fsspec has an ftp implementation, the tap references ftp explicitly in the code, so there is a good chance it works.
p
Thanks a bunch!
p
I successfully use https://github.com/ets/tap-spreadsheets-anywhere to extract genomics / proteomics files like these from the web. A few things I’ve learned: • These are actually https (not ftp) resources, despite the genomics community’s propensity to put
ftp.
in the domain name. This tap recognizes this and handles it correctly. • For https resources, the tap uses the file’s modification date as the state bookmark. So for “current” views of a data set such as these, for which the contents get periodically updated but the URL doesn’t change, Meltano will (correctly, for a typical ELT use case) retrieve new data from the same URL when its modification date changes. • Unlike for other protocols such as s3, the tap can’t extract data from all files located in some folder and matching some pattern (since a webserver may not support directory listing). It will only extract data from a specific file. So you need to configure the entire domain + directory as the
path
and the file’s base name as the
pattern
. • Genomics files tend to be huge, and genomics public domain webservers tend to not be very performant, so extraction can take a long time. So to avoid dropping the http connection, it’s best to configure your target to retrieve the entire file in a single batch, so that the connection doesn’t need to be held open for a later batch while waiting for an earlier batch to load. This means the EL job must also run on a server that can hold the entire JSON-expanded file in memory.
👍 1