Following up on <@U06CCB0EUBC>’s post: I’m in the...
# singer-tap-development
d
Following up on @visch’s post: I’m in the process of building a ClickUp tap, and his first item is where I’m stuck right now. What’s the best practice for handling APIs that don’t return pagination hints for a given endpoint? ClickUp’s Get Tasks operation doesn’t provide a record count total, nor does it provide pagination hints/props. It uses a
page
0-based index for requests, and returns a maximum of 100 items. Unfortunately, if there are less than 100 items, and you pass a
page
querystring param that should return 0 rows, it merrily returns the complete set of records. For example, this request (using ClickUp’s Apiary Mock endpoint, so you can run it as-is) returns a list of valid workspaces for the ClickUp Team with a
team_id
of `512`:
Copy code
curl <https://private-anon-1ab3e1ce0e-clickup20.apiary-mock.com/api/v2/team/512/space>
Note that there are only two workspaces in the response. With a maximum of 100 records per-page, adding a
page
querystring param should return zero records. However, If I add a
page
querystring param (per the API docs) to return the “next” page of records…
Copy code
curl <https://private-anon-1ab3e1ce0e-clickup20.apiary-mock.com/api/v2/team/512/space?page=1>
I get back the same two entries. This is true for live API requests, also. Is there a recommended approach for dealing with APIs like this? That is, other than submitting a request to ClickUp to fix their API, which I have already done. 😄
a
@dustin_miller - Are there any hints at all that either (1) we're already on the last page or (2) that the returned "page 2" isn't really so?
I see this guidance on their docs page:
Unfortunately, this means 1% of the time you will have exactly 100 records and there also won't be a next page - in which case, ... (?)
I mean, if we just take them at their guidance, I'd say go ahead and count the records, then explicitly set
next_page_token
to
None
any time
num_records < records_per_page
- but it doesn't solve for when there are exactly 100 records on the last page - which should be true approximately 1% of the time.
One "creative" option is to store in the
next_page_token
a hash of all the records received per request. Then, checking the last hash against the new hash will tell you the new 100 records are exactly the same as the prior 100, in which case you could then treat the result as 0 records.
Combining these two methods could in theory give you a solution - but it isn't exactly pretty 😬
d
That’s…FUN 😄 I was just wondering if storing the
id
of the first record returned would work, but that depends on the sort order being consistent from one request to another (not all endpoints support a
sort_by
param)
Yeah, not pretty. I think hashing the `id`s is probably the safest approach to see if a “page” is the same as the previous one.
I don’t think I want to hash the entire response object, just in case (since we’re splitting hairs) someone goes and changes a resource’s properties in between paginated requests.
a
I don't know the internals of
id
to confirm/question the approach, but yes, you can count the records and stop when n<100 - and also any heuristic to keep from looping when the last page has 100 items.
d
But that’s only because I’m unsure ……… What’s the default behavior if an ostensibly unique
primary_keys
result is repeated within a given stream? Last one wins?
a
In case it's not clear, you can put any non-falsy thing you want inside
next_page_token
- and you'll have the prior token when you're evaluating the next one. So, for this use case, you'll probably want to make it a dict so you can keep more detailed info from one page to the next.
d
oh that’s a good point, I don’t have to pass that token into the request, after all.
a
Yeah - the test of whether continue is
if next_page_token
- so as long as its not empty or None or similar, anything else will keep the flow going - and then you can smartly compare whatever vars are needed.
d
🛎️
Perfect. I’ll do that. In anger. While I wait for ClickUp to fix their API with proper pagination support. 🤣
a
What’s the default behavior if an ostensibly unique 
primary_keys
 result is repeated within a given stream? Last one wins?
Yes - but I'd be slightly worried about deduping before the merge. On some targets they create a batch and merge upsert and/or insert the result to the target table. Many dedupe also before the merge upsert, but that could vary. ("I'm not sure" is the shorter answer.)
d
Yes - but I’d be slightly worried about deduping before the merge. [snip]
Fair point. If I want this to be usable for any
target-
I’d need to make sure I handle that myself in the tap. Would that best be wrapped into a
stream_map
or would it make sense to put some basic “last one wins” logic (or whatever seems correct) into
post_process
in
client.py
?
a
If you get pagination workarounds of some sort, hopefully you won't end up sending duplicate records in the same sync operation. That said, if all else fails, you could keep a hash (or other tracker) of each record you've seen and in
post_process()
you can simply return
None
to skip a record entirely if it has already been sent.
Stream maps aren't a great candidate because they operate on each record in isolation.
Python lookup tables are pretty performant, so I bigger scalability issue I see in keeping dedupe logic in the tap is how many records hashes (or keys) you need to keep in memory at any point in time. Hundreds of thousands would likely be doable; billions maybe less so. 😄
v
@dustin_miller Alredy made a clickup tap https://github.com/AutoIDM/tap-clickup feel free to give it a spin! We're getting close to publishing it to MeltanoHub / Stitch
https://meltano.slack.com/archives/C01PKLU5D1R/p1633131234227000?thread_ts=1633130032.222700&amp;cid=C01PKLU5D1R for clickup specifically you can sort the stream which helps, you can of course still hit the issue if something changes mid run. Right now with is_sorted = True set it'll just fail out if somehow you hit that case, although it seems very unlikely as it'd have to happen right between your requests
In regards to your comment starting this thread. I think the behavior your seeing is from pagination not existing on those endpoints. I haven't seen that many spaces, which is probably why they haven't implemented pagination for them 🤷 , My gut is that they would return all of the spaces if you have >100 we could test it out
I can tell you for Tasks the behavior makes a lot more sense
Unfortunately, if there are less than 100 items, and you pass a page querystring param that _*should*_ return 0 rows, it merrily returns the complete set of records.
I don't agree completely. The initial page is page 0. Page 1 I believe acts appropriately in the case where you have <100 tasks in a folder / folderless portion
It'd be great to see you submit anything extra you need as a PR to that tap we have going!
d
Sweet, thanks @visch!
Will eval on our setup soon