Hello all! We are working on developing a Singer t...
# singer-tap-development
k
Hello all! We are working on developing a Singer tap based on Meltano SDK to extract data from the Statistics Estonia Database through a REST API. Statistics Estonia publishes statistical datasets that are accessible through their API and so far we have managed to get everything working except for having a challenge with the Singer catalog feature and it's implementation in Meltano. Perhaps we are approaching it wrong, but the idea is the following: 1. We want to enable users to run Meltano select to select which datasets they want to extract from the source. 2. We use the discover_streams method in tap.py to get the catalog from the source - but the cataloging process itself takes quite a long time. 3. We do not want the catalog discovery to take place each time the tap is run, just during installation, as the catalog rarely changes. So in general, the idea is that the catalog itself is created dynamically during installation. Catalog itself might be quite large (.json file around 15MB) and catalog creation might take 20-30 minutes, so we really want to cache it. So far from what we have learned is that there does not seem to be a good way to cache the catalog by Meltano. Any suggestions?
h
It looks like Statistics Estonia uses PxWeb for publishing statistics, I have created tap-pxwebapi for that purpose: https://hub.meltano.com/extractors/tap-pxwebapi. Let me know if it works for you.
1
e
In your custom catalog creation implementation, you could short-circuit to a cached catalog on disk in a known location with something like platformdirs (that way all installations of the tap reference the same cached catalog), otherwise you generate the catalog and save it to disk.
We do not want the catalog discovery to take place each time the tap is run, just during installation
There's tricks you could do with poetry using a custom build script where you could implement to generate the catalog at installation time, but it's probably not a good idea 😅
h
I have been thinking about this one a bit, and a couple of things occur to me. One thing we could do is to accept an optional “schema” config, that would basically be the output of the discover command. But I’m not sure that takes us where we want to be. The thing with this API is that schema discovery also answers if there is new data that should be loaded, so removing the schema api call means we probably have to do a call for the actual data. In the current implementation I haven’t found how to stop the run from inside the schema method, so it ends up loading at least the latest batch of data, but if we figure that one out, we would substantially reduce the number of API requests. Maybe @Edgar Ramírez (Arch.dev) has some ideas? tl;dr: today, every invocation creates two calls: one call is “get the schema and find out if there is new data”, the other call is to get the new data. If we can avoid getting data if there is nothing new, that would be positive wrt the rate limiting. Update: reading through the thinking about select and schema discovery, yes schema discovery in the sense of discovering which tables are available takes a looong time the way that has to be done. It hadn’t really occurred to me to do it like that. So I was basically addressing a different and much smaller issue regarding discovering the columns in a given dataset.