I have a super hierarchical api I m working with I need to g Meltano #singer-tap-development

I have a super-hierarchical api I’m working with. ...

stephen_lloyd

04/12/2021, 4:22 AM

I have a super-hierarchical api I’m working with. I need to get tasks, but they are several levels down.

Copy code

teams
  |
  --spaces
    |
    --folders
      |
      --lists
        |
        --tasks

At each level, I need to send data to Snowflake, so each level is a valid stream. But I also need to remember all the ids from each level so that I can loop through them. It looks like partitions would help, but I’m not certain how they are working. The gitlab example only deals with one level of a hierarchy, I think. Any tips? EDIT: changed bottom level of the hierarchy from spaces -> tasks

aaronsteers

04/12/2021, 4:55 AM

Hi, @stephen_lloyd. This is tricky because defining the partitions for

spaces

in your example basically requires one state entry and one call for each row from

lists

, which requires one call for each row from

folders

, and so on. Do you know what would be the approximate order of magnitude (hundreds, thousands, or closer to millions+) for the next-to-lowest grain? That would probably drive whether partitions are a feasible solution alone. Otherwise, the tap would likely need to make additional REST calls at runtime to loop through the parent structures.

stephen_lloyd

04/12/2021, 5:12 AM

I think we’re at most in the 100s range for the next to lowest grain. Probably low 10's of thousands range at the moment for the lowest grain. the hierarchy of calls is:

Copy code

<https://api.clickup.com/api/v2/team>
<https://api.clickup.com/api/v2/team/{team_id}/space>
<https://api.clickup.com/api/v2/space/{space_id}/folder>
<https://api.clickup.com/api/v2/folder/{folder_id}/list>
<https://api.clickup.com/api/v2/list/{list_id}/task>

ken_payne

04/12/2021, 2:06 PM

I had to solve a similar problem when creating tap-tableau-wrangler 🤔 The way I did it was to introduce a

Service

class to traverse the hierarchy of my upstream source and present 'flat' lists of dicts for streams to consume. The implementation was actually one of the topics of

Demo Day▾

last week 😅 If you have large volumes of data to retrieve, this solution may be quite memory-intensive, but it works for us.

stephen_lloyd

04/12/2021, 3:32 PM

Thanks @aaronsteers and @ken_payne. I’ll review that solution.

aaronsteers

04/12/2021, 7:53 PM

I think we’re at most in the 100s range for the next to lowest grain.

At this scale, partitions should still be scalable, since (1) you can traverse that number of items during job initialization with little realtime latency, and (2) the size of those items each having their own partition entry in

state

would not become a scaling challenge. That said, you could also take the service approach that @ken_payne used for Tableau instead of or in addition to the partitions approach.

ken_payne

04/12/2021, 8:27 PM

I haven’t come across partitions yet @aaronsteers - is there a write up of how they work somewhere? I was starting to think along the lines of ‘child streams’ as a neater way of handling these nested/tree source types, but maybe partitions are better 😅

aaronsteers

04/12/2021, 8:40 PM

@ken_payne Yeah, I’m adding formal docs on this topic in 0.1.2 - you can preview it in the development branch here and here. And if you have comments, please do add them here in the 0.1.2 MR

ken_payne

04/12/2021, 9:23 PM

Epic, thanks for that! Looks really interesting, but I am not sure I see how it solves the hierarchical stream case 🤔 I have some ideas for how hierarchical streams could work, and I’ll type them up into an issue in the morning. Looks like at least two cases where they might be useful, and what I have in mind should hopefully work well as an optional feature/extension to the standard implementation (as partitioning does) 🤞

aaronsteers

04/12/2021, 9:24 PM

@ken_payne - I’ve seen partitions helpful for hierarchical streams of one-layer depth or perhaps two layers. I think the jury is still out whether there’s a viable patter for 4-5 levels deep.

aaronsteers

04/13/2021, 8:07 PM

@stephen_lloyd - Would love to hear your thoughts on this issue which @ken_payne raised: add support for hierarchical streams (#97) · Issues · meltano / Singer SDK · GitLab

aaronsteers

04/13/2021, 8:08 PM

@ken_payne - Thanks for logging that. I added some comments regarding incremental and deselected parent streams for your review.

aaronsteers

04/13/2021, 8:09 PM

@ken_payne and @stephen_lloyd - By chance would either of you want to discuss these tomorrow in #C01QS0RV78D? This is a part of the spec I do want to improve and it might be helpful to discuss and share these conversations also with the community.

stephen_lloyd

04/14/2021, 3:48 AM

@aaronsteers I’ll take a look at the Issue. When is the office hours? I only found links up to the one last week.

aaronsteers

04/14/2021, 2:13 PM

@stephen_lloyd Office hours timing today is 9AM Pacific / 16:00 UTC.

stephen_lloyd

04/14/2021, 3:53 PM

Thanks. Link?

aaronsteers

04/14/2021, 3:55 PM

https://gitlab.zoom.us/j/96641645026