Hello, I looked at what you did with the SDK and ...
# singer-tap-development
p
Hello, I looked at what you did with the SDK and great job! The API is quite lean 😃 The fact that the state management + network calls is almost completely hidden is a nice achievement 👍 Based on my recent experience (or “trauma” 😅 ) on writing a tap for Hubspot, I thought about how I could had used the SDK as thought experiment and I have some questions 🤔 Hubspot is a CRM offering a REST API based tap. Streams are Contacts / Deals / Deal Pipeline / Deal Pipeline Stage / Owners (e.g. salesperson). And the APIs are quite a mess so it may not be a good example. a. To extract some streams from Hubspot (ex. contacts, from 0 to 200k records), you have to fetch 2 endpoints in sequence: 1. the first one to fetch the Ids of what you want to extract. This one is paginated. 2. the second one to fetch the detail of each record of the stream (contact with the “custom properties”), this one requires to pass in the query string the list of Ids fetched in the call 1 to work. (and it’s not possible to consume directly this endpoint in a paginated manner, because it wouldn’t be funny otherwise…) I’m not sure of how it could work with
RESTStream
. The closest thing that I saw was: https://gitlab.com/meltano/singer-sdk/-/blob/development/singer_sdk/samples/sample_tap_gitlab/gitlab_rest_streams.py#L160 where 2 streams are used and a sort of state (is it the same that the tap state?) is used to pass data from the 2 streams. In Hubspot, there are thousands or hundred of thousands Contacts Ids to be passed between the 2 streams. Would the recommended solution be the same as the one implemented into the Gitlab tap? b. Some streams in Hubspot are nested into a single API / Endpoint. Ex. Deal Pipeline and Deal Pipeline Stage are returned through 2 nested lists into a single endpoint. How would it be mapped with the SDK? Is it possible to create a Stream that consume the results of the API calls of another Stream without doing API calls? c. What is the concurrency model of the SDK between Streams? Are all streams synced in sequence, or is there some sort of concurrency? My experience shown that Hubspot paginated APIs have high latency so the “total sync time” can be quite long: we’re spending a lot of time waiting for the next page (and when people have 100k records, that a lot of pages…). Syncing different paginated streams (ex. Contacts and Deals) at the same time is saving time in this case. How would the SDK have helped? 😃
j
Hello @pierre_de_poulpiquet I’ve been struggling with the same API I’m using the Python library of Hubspot instead of using the endpoints. I don’t think that I can solve your problem but I’m very keen to know if you published your tap in open source
a
@pierre_de_poulpiquet - First of all - thanks very much for your thoughtful and detailed post! This kind of feedback is super helpful. Now to your questions… Re: your case of needing to make 2 subsequent calls for each record Option 1: Parent-Child streams, with the parent keys serialized in state. As demonstrated in the example you linked to, a first “parent” stream populates into the state what parent keys the child stream will later need access to. The child stream can then treat these keys as partitions. For example, if I have a parent stream of “issues” and a child stream of “issue comments”, I can save the issues keys in state, which will then be used as partitions of the child “issue comments” stream. Note: this is only needed when you can’t call the child stream directly, in which case you have to make as many calls as there are parent keys (plus pagination). Option 2: Implement subsequent calls, one per record, in a new optional
RESTStream.append_extra_record_data()
method (or similar). This doesn’t exist yet but in theory could be defaulted to do nothing and then called by the framework here before yielding each individual record. Option 3: Implement subsequent request calls, in batch, during
RESTStream.parse_response()
here. This is similar to option 2 but faster if you can send multiple record keys to the REST endpoint at the same time.
@pierre_de_poulpiquet - Based on your description, I think you would need option 2 or 3, but I’m not sure which. And then, we don’t yet have dedicated helper functions or methods for interfacing with the
requests
library apart from the main flow. We’d probably want to add those - perhaps a singleton call that assumes the same authenticator and http header, but takes a custom path, custom http params, and custom payload. Thoughts?
Are all streams synced in sequence, or is there some sort of concurrency?
Currently the streams are sync’ed in sequence - but in theory this can be overridden and in the future, we might build in the capability to auto-parallelize up to a specific degree of parallelism. Another option which I’ve seen people do is to simply have two or more groupings of streams for the same tap - running in parallel at the orchestrator level (Meltano meltano).
@pierre_de_poulpiquet and @juan_sebastian_suarez_valencia - If you have time tomorrow to join during #C01QS0RV78D, it would be a great discussion topic! 🙂
@edgar_ramirez_mondragon 👆 FYI This relates to your previous feedback:
A feature that may be of interest to myself and other people developing
RESTful taps may be the ability to query an endpoint for each record in a
stream (e.g. 
/content/{content_id}/views
).
e
Ah yeah, definitely. Lot's of APIs have endpoints like that, and it's especially painful when there's m2m relationships between the entities but the mapping table is not exposed. It may require something like what scrapy does to dedup links in pages. Concurrency for RESTful taps should be easy to implement with a single async session per tap invocation, I think.