I want to extract from a REST API that has an endpoint say ` Meltano #singer-tap-development

I want to extract from a REST API that has an endp...

Jens Christian Hillerup

11/27/2024, 2:23 PM

I want to extract from a REST API that has an endpoint, say

GET /documents

that gives me a list of documents, and then another one like

GET /document/<doc_id>

that returns important information that's not part of the former. I've wrapped REST APIs before with the tap cookiecutter in the Meltano SDK, but how would this actually translate into a

RESTStream

in the Singer tap? Is it even a supported use case to have an "N+1" stream which requires another roundtrip per record? (I'm aware it's going to be slow, luckily it's not that many rows). Thanks!

Andy Carter

11/27/2024, 3:43 PM

I've used this pattern before as a

DocumentSummary

which hits the 'get documents' endpoint and then a

DocumentDetail

child stream to each, assuming a 1:1 relationship. Then you need to rejoin them downstream in DBT or similar.

❤️ 1

➕ 1

Reuben (Matatika)

11/27/2024, 5:19 PM

We do something like this in

tap-spotify

- get tracks and then make a (single) separate request to grab the audio features for each and merge into the track record. Sort of violates the principle of ELT as this is a technically a transformation, but it is possible. https://github.com/Matatika/tap-spotify/blob/f944d7430f9003ef589acec21f67b16c14a09095/tap_spotify/streams.py#L41-L94

❤️ 1

Jens Christian Hillerup

12/04/2024, 9:16 AM

Cool. I built it like @Reuben (Matatika) suggested (although it is a bit of a hack). Understandably it's pretty slow because it's basically a N+1 operation. I wonder, if I built it like @Andy Carter suggested, would I be able to use, say,

aiohttp

to extract more of these documents simultaneously? Suppose I could also do a

multiprocessing

requests

thing.

Reuben (Matatika)

12/04/2024, 9:42 AM

The only reason we implemented it in that way was because Spotify offered an endpoint which allowed you to fetch audio features for a bunch of tracks in a single request (assuming your API doesn't have the same kind of bulk operation), so it is more performant than child streams but definitely still a hack to work with the API in an optimal way. As Andy said, if the relationship is 1:1 and you don't mind a separate stream (i.e. separate table with a database loader) for the document detail (

GET /document/{id}

), I would go with that approach using child streams. There's been a fair amount of talk around Meltano running streams in parallel, so if you did declare document detail as a child stream you may be able to benefit from that feature in the future. I think this is the primary issue: https://github.com/meltano/meltano/issues/2677

49 Views

Open in Slack

Previous Next