I have a super-hierarchical api I’m working with. ...
# singer-tap-development
s
I have a super-hierarchical api I’m working with. I need to get tasks, but they are several levels down.
Copy code
teams
  |
  --spaces
    |
    --folders
      |
      --lists
        |
        --tasks
At each level, I need to send data to Snowflake, so each level is a valid stream. But I also need to remember all the ids from each level so that I can loop through them. It looks like partitions would help, but I’m not certain how they are working. The gitlab example only deals with one level of a hierarchy, I think. Any tips? EDIT: changed bottom level of the hierarchy from spaces -> tasks
a
Hi, @stephen_lloyd. This is tricky because defining the partitions for
spaces
in your example basically requires one state entry and one call for each row from
lists
, which requires one call for each row from
folders
, and so on. Do you know what would be the approximate order of magnitude (hundreds, thousands, or closer to millions+) for the next-to-lowest grain? That would probably drive whether partitions are a feasible solution alone. Otherwise, the tap would likely need to make additional REST calls at runtime to loop through the parent structures.
s
I think we’re at most in the 100s range for the next to lowest grain. Probably low 10's of thousands range at the moment for the lowest grain. the hierarchy of calls is:
Copy code
<https://api.clickup.com/api/v2/team>
<https://api.clickup.com/api/v2/team/{team_id}/space>
<https://api.clickup.com/api/v2/space/{space_id}/folder>
<https://api.clickup.com/api/v2/folder/{folder_id}/list>
<https://api.clickup.com/api/v2/list/{list_id}/task>
k
I had to solve a similar problem when creating tap-tableau-wrangler 🤔 The way I did it was to introduce a
Service
class to traverse the hierarchy of my upstream source and present 'flat' lists of dicts for streams to consume. The implementation was actually one of the topics of

Demo Day

last week 😅 If you have large volumes of data to retrieve, this solution may be quite memory-intensive, but it works for us.
s
Thanks @aaronsteers and @ken_payne. I’ll review that solution.
a
I think we’re at most in the 100s range for the next to lowest grain.
At this scale, partitions should still be scalable, since (1) you can traverse that number of items during job initialization with little realtime latency, and (2) the size of those items each having their own partition entry in
state
would not become a scaling challenge. That said, you could also take the service approach that @ken_payne used for Tableau instead of or in addition to the partitions approach.
k
I haven’t come across partitions yet @aaronsteers - is there a write up of how they work somewhere? I was starting to think along the lines of ‘child streams’ as a neater way of handling these nested/tree source types, but maybe partitions are better 😅
a
@ken_payne Yeah, I’m adding formal docs on this topic in 0.1.2 - you can preview it in the development branch here and here. And if you have comments, please do add them here in the 0.1.2 MR
k
Epic, thanks for that! Looks really interesting, but I am not sure I see how it solves the hierarchical stream case 🤔 I have some ideas for how hierarchical streams could work, and I’ll type them up into an issue in the morning. Looks like at least two cases where they might be useful, and what I have in mind should hopefully work well as an optional feature/extension to the standard implementation (as partitioning does) 🤞
a
@ken_payne - I’ve seen partitions helpful for hierarchical streams of one-layer depth or perhaps two layers. I think the jury is still out whether there’s a viable patter for 4-5 levels deep.
@stephen_lloyd - Would love to hear your thoughts on this issue which @ken_payne raised: add support for hierarchical streams (#97) · Issues · meltano / Singer SDK · GitLab
@ken_payne - Thanks for logging that. I added some comments regarding incremental and deselected parent streams for your review.
@ken_payne and @stephen_lloyd - By chance would either of you want to discuss these tomorrow in #C01QS0RV78D? This is a part of the spec I do want to improve and it might be helpful to discuss and share these conversations also with the community.
s
@aaronsteers I’ll take a look at the Issue. When is the office hours? I only found links up to the one last week.
a
@stephen_lloyd Office hours timing today is 9AM Pacific / 16:00 UTC.
s
Thanks. Link?
a
s
thanks and apologies, I just realized I could click on the office-hours channel link. I’m feeling a little slow I guess.
a
No worries at all. I was moving quickly and didn’t really explain the link when I posted it. Glad you make it though! Really appreciated your input.