Plamen Tarkalanov
03/13/2024, 2:10 PMvisch
03/13/2024, 3:55 PMvisch
03/13/2024, 3:59 PMftp
🤷Henning Holgersen
03/13/2024, 4:01 PMPlamen Tarkalanov
03/13/2024, 5:58 PMpeter_s
03/18/2024, 10:33 PMftp.
in the domain name. This tap recognizes this and handles it correctly.
• For https resources, the tap uses the file’s modification date as the state bookmark. So for “current” views of a data set such as these, for which the contents get periodically updated but the URL doesn’t change, Meltano will (correctly, for a typical ELT use case) retrieve new data from the same URL when its modification date changes.
• Unlike for other protocols such as s3, the tap can’t extract data from all files located in some folder and matching some pattern (since a webserver may not support directory listing). It will only extract data from a specific file. So you need to configure the entire domain + directory as the path
and the file’s base name as the pattern
.
• Genomics files tend to be huge, and genomics public domain webservers tend to not be very performant, so extraction can take a long time. So to avoid dropping the http connection, it’s best to configure your target to retrieve the entire file in a single batch, so that the connection doesn’t need to be held open for a later batch while waiting for an earlier batch to load. This means the EL job must also run on a server that can hold the entire JSON-expanded file in memory.