I would like to get a sanity check for a tap I'm c...
# singer-tap-development
m
I would like to get a sanity check for a tap I'm considering developing. I've not done one before, and I wanted to see if there were any red flags or get a little affirmation that I'm on the correct path. I want to add the data from a series of xml files I have locally. Following https://docs.meltano.com/tutorials/custom-extractor, I intend to create a tap using the cookiecutter from the meltano sdk, stream type other, no auth. Input should include a path (dir or file), keys, and an optional xsl to transform the xml to something 'flat'. The plugin should support key-based replication, incremental replication (by date of most recently processed file). I will rely heavily upon pandas to use their read_xml function to process the xslt and form the data into rows.
h
Hmmm… From what I can find, it seems you can define the header rows in the xslt itself, which makes things easier. This can be a kind of two-step process of first converting the file and then loading it. That lets you basically extend an existing tap if you want. It might also be a free-standing utility, but I don’t know enough about utilities to opine on that. A simplifying assumption is to assume that the folder you read from doesn’t contain any CSV files (at least not with the same file name as the XMLs), so that you don’t have to think up some temp storage for them. Specifying key columns would then work as with specifying key columns in tap-csv or wherever, simply refer to the CSV column values that the xsl writes. There are all sorts of situations that could get complicated, like if you want to map one XML file to several CSVs, but I don’t see any fundamental problem here as long as we accept that might not work in all situations.
m
When you say utility, are there ways to integrate a meltano pipeline with an external application? For example, to have the meltano tap-csv run a python script on a given folder prior to moving on the rest of the work for the tap? That would be a simpler kludge to constructing a fresh tap, if I understand you correctly. I just need to be sure it integrates closely with meltano's orchestration.
m
you might take a look at https://github.com/ets/tap-spreadsheets-anywhere - it doesn’t support XML but it does support local filepaths
as far as a sanity check, what you’ve described seems pretty reasonable to me
e
the only red flag to me would be pandas since it’s a rather hefty dependency, but in your case it might be the easiest way to process the xml doc, so it’s probably ok