I want to extract from a REST API that has an endp...
# singer-tap-development
j
I want to extract from a REST API that has an endpoint, say
GET /documents
that gives me a list of documents, and then another one like
GET /document/<doc_id>
that returns important information that's not part of the former. I've wrapped REST APIs before with the tap cookiecutter in the Meltano SDK, but how would this actually translate into a
RESTStream
in the Singer tap? Is it even a supported use case to have an "N+1" stream which requires another roundtrip per record? (I'm aware it's going to be slow, luckily it's not that many rows). Thanks!
a
I've used this pattern before as a
DocumentSummary
which hits the 'get documents' endpoint and then a
DocumentDetail
child stream to each, assuming a 1:1 relationship. Then you need to rejoin them downstream in DBT or similar.
❤️ 1
1
r
We do something like this in
tap-spotify
- get tracks and then make a (single) separate request to grab the audio features for each and merge into the track record. Sort of violates the principle of ELT as this is a technically a transformation, but it is possible. https://github.com/Matatika/tap-spotify/blob/f944d7430f9003ef589acec21f67b16c14a09095/tap_spotify/streams.py#L41-L94
❤️ 1
j
Cool. I built it like @Reuben (Matatika) suggested (although it is a bit of a hack). Understandably it's pretty slow because it's basically a N+1 operation. I wonder, if I built it like @Andy Carter suggested, would I be able to use, say,
aiohttp
to extract more of these documents simultaneously? Suppose I could also do a
multiprocessing
+
requests
thing.
r
The only reason we implemented it in that way was because Spotify offered an endpoint which allowed you to fetch audio features for a bunch of tracks in a single request (assuming your API doesn't have the same kind of bulk operation), so it is more performant than child streams but definitely still a hack to work with the API in an optimal way. As Andy said, if the relationship is 1:1 and you don't mind a separate stream (i.e. separate table with a database loader) for the document detail (
GET /document/{id}
), I would go with that approach using child streams. There's been a fair amount of talk around Meltano running streams in parallel, so if you did declare document detail as a child stream you may be able to benefit from that feature in the future. I think this is the primary issue: https://github.com/meltano/meltano/issues/2677