https://linen.dev logo
#singer-tap-development
Title
# singer-tap-development
p

pierre_de_poulpiquet

03/23/2021, 8:07 PM
Hello, I looked at what you did with the SDK and great job! The API is quite lean 😃 The fact that the state management + network calls is almost completely hidden is a nice achievement 👍 Based on my recent experience (or “trauma” 😅 ) on writing a tap for Hubspot, I thought about how I could had used the SDK as thought experiment and I have some questions 🤔 Hubspot is a CRM offering a REST API based tap. Streams are Contacts / Deals / Deal Pipeline / Deal Pipeline Stage / Owners (e.g. salesperson). And the APIs are quite a mess so it may not be a good example. a. To extract some streams from Hubspot (ex. contacts, from 0 to 200k records), you have to fetch 2 endpoints in sequence: 1. the first one to fetch the Ids of what you want to extract. This one is paginated. 2. the second one to fetch the detail of each record of the stream (contact with the “custom properties”), this one requires to pass in the query string the list of Ids fetched in the call 1 to work. (and it’s not possible to consume directly this endpoint in a paginated manner, because it wouldn’t be funny otherwise…) I’m not sure of how it could work with
RESTStream
. The closest thing that I saw was: https://gitlab.com/meltano/singer-sdk/-/blob/development/singer_sdk/samples/sample_tap_gitlab/gitlab_rest_streams.py#L160 where 2 streams are used and a sort of state (is it the same that the tap state?) is used to pass data from the 2 streams. In Hubspot, there are thousands or hundred of thousands Contacts Ids to be passed between the 2 streams. Would the recommended solution be the same as the one implemented into the Gitlab tap? b. Some streams in Hubspot are nested into a single API / Endpoint. Ex. Deal Pipeline and Deal Pipeline Stage are returned through 2 nested lists into a single endpoint. How would it be mapped with the SDK? Is it possible to create a Stream that consume the results of the API calls of another Stream without doing API calls? c. What is the concurrency model of the SDK between Streams? Are all streams synced in sequence, or is there some sort of concurrency? My experience shown that Hubspot paginated APIs have high latency so the “total sync time” can be quite long: we’re spending a lot of time waiting for the next page (and when people have 100k records, that a lot of pages…). Syncing different paginated streams (ex. Contacts and Deals) at the same time is saving time in this case. How would the SDK have helped? 😃