Following up on < visch> s post I m in the process of buildi Meltano #singer-tap-development

Following up on <@U06CCB0EUBC>’s post: I’m in the...

dustin_miller

10/01/2021, 11:13 PM

Following up on @visch’s post: I’m in the process of building a ClickUp tap, and his first item is where I’m stuck right now. What’s the best practice for handling APIs that don’t return pagination hints for a given endpoint? ClickUp’s Get Tasks operation doesn’t provide a record count total, nor does it provide pagination hints/props. It uses a

page

0-based index for requests, and returns a maximum of 100 items. Unfortunately, if there are less than 100 items, and you pass a

page

querystring param that should return 0 rows, it merrily returns the complete set of records. For example, this request (using ClickUp’s Apiary Mock endpoint, so you can run it as-is) returns a list of valid workspaces for the ClickUp Team with a

team_id

of `512`:

Copy code

curl <https://private-anon-1ab3e1ce0e-clickup20.apiary-mock.com/api/v2/team/512/space>

Note that there are only two workspaces in the response. With a maximum of 100 records per-page, adding a

page

querystring param should return zero records. However, If I add a

page

querystring param (per the API docs) to return the “next” page of records…

Copy code

curl <https://private-anon-1ab3e1ce0e-clickup20.apiary-mock.com/api/v2/team/512/space?page=1>

I get back the same two entries. This is true for live API requests, also. Is there a recommended approach for dealing with APIs like this? That is, other than submitting a request to ClickUp to fix their API, which I have already done. 😄

aaronsteers

10/01/2021, 11:18 PM

@dustin_miller - Are there any hints at all that either (1) we're already on the last page or (2) that the returned "page 2" isn't really so?

aaronsteers

10/01/2021, 11:20 PM

I see this guidance on their docs page:

aaronsteers

10/01/2021, 11:20 PM

Unfortunately, this means 1% of the time you will have exactly 100 records and there also won't be a next page - in which case, ... (?)

aaronsteers

10/01/2021, 11:21 PM

I mean, if we just take them at their guidance, I'd say go ahead and count the records, then explicitly set

next_page_token

None

any time

num_records < records_per_page

- but it doesn't solve for when there are exactly 100 records on the last page - which should be true approximately 1% of the time.

aaronsteers

10/01/2021, 11:25 PM

One "creative" option is to store in the

next_page_token

a hash of all the records received per request. Then, checking the last hash against the new hash will tell you the new 100 records are exactly the same as the prior 100, in which case you could then treat the result as 0 records.

aaronsteers

10/01/2021, 11:26 PM

Combining these two methods could in theory give you a solution - but it isn't exactly pretty 😬

dustin_miller

10/01/2021, 11:27 PM

That’s…FUN 😄 I was just wondering if storing the

id

of the first record returned would work, but that depends on the sort order being consistent from one request to another (not all endpoints support a

sort_by

param)

dustin_miller

10/01/2021, 11:27 PM

Yeah, not pretty. I think hashing the `id`s is probably the safest approach to see if a “page” is the same as the previous one.

dustin_miller

10/01/2021, 11:28 PM

I don’t think I want to hash the entire response object, just in case (since we’re splitting hairs) someone goes and changes a resource’s properties in between paginated requests.

aaronsteers

10/01/2021, 11:29 PM

I don't know the internals of

id

to confirm/question the approach, but yes, you can count the records and stop when n<100 - and also any heuristic to keep from looping when the last page has 100 items.

dustin_miller

10/01/2021, 11:30 PM

But that’s only because I’m unsure ……… What’s the default behavior if an ostensibly unique

primary_keys

result is repeated within a given stream? Last one wins?

aaronsteers

10/01/2021, 11:30 PM

In case it's not clear, you can put any non-falsy thing you want inside

next_page_token

- and you'll have the prior token when you're evaluating the next one. So, for this use case, you'll probably want to make it a dict so you can keep more detailed info from one page to the next.

dustin_miller

10/01/2021, 11:31 PM

oh that’s a good point, I don’t have to pass that token into the request, after all.

aaronsteers

10/01/2021, 11:32 PM

Yeah - the test of whether continue is

if next_page_token

- so as long as its not empty or None or similar, anything else will keep the flow going - and then you can smartly compare whatever vars are needed.

dustin_miller

10/01/2021, 11:33 PM

🛎️

dustin_miller

10/01/2021, 11:33 PM

Perfect. I’ll do that. In anger. While I wait for ClickUp to fix their API with proper pagination support. 🤣

aaronsteers

10/01/2021, 11:33 PM

What’s the default behavior if an ostensibly unique
primary_keys
result is repeated within a given stream? Last one wins?

Yes - but I'd be slightly worried about deduping before the merge. On some targets they create a batch and merge upsert and/or insert the result to the target table. Many dedupe also before the merge upsert, but that could vary. ("I'm not sure" is the shorter answer.)

dustin_miller

10/01/2021, 11:40 PM

Yes - but I’d be slightly worried about deduping before the merge. [snip]

Fair point. If I want this to be usable for any

target-

I’d need to make sure I handle that myself in the tap. Would that best be wrapped into a

stream_map

or would it make sense to put some basic “last one wins” logic (or whatever seems correct) into

post_process

client.py

aaronsteers

10/02/2021, 1:28 AM

If you get pagination workarounds of some sort, hopefully you won't end up sending duplicate records in the same sync operation. That said, if all else fails, you could keep a hash (or other tracker) of each record you've seen and in

post_process()

you can simply return

None

to skip a record entirely if it has already been sent.

aaronsteers

10/02/2021, 1:28 AM

Stream maps aren't a great candidate because they operate on each record in isolation.

aaronsteers

10/02/2021, 1:31 AM

Python lookup tables are pretty performant, so I bigger scalability issue I see in keeping dedupe logic in the tap is how many records hashes (or keys) you need to keep in memory at any point in time. Hundreds of thousands would likely be doable; billions maybe less so. 😄

visch

10/02/2021, 5:13 PM

@dustin_miller Alredy made a clickup tap https://github.com/AutoIDM/tap-clickup feel free to give it a spin! We're getting close to publishing it to MeltanoHub / Stitch

visch

10/02/2021, 5:17 PM

https://meltano.slack.com/archives/C01PKLU5D1R/p1633131234227000?thread_ts=1633130032.222700&cid=C01PKLU5D1R for clickup specifically you can sort the stream which helps, you can of course still hit the issue if something changes mid run. Right now with is_sorted = True set it'll just fail out if somehow you hit that case, although it seems very unlikely as it'd have to happen right between your requests

visch

10/02/2021, 5:50 PM

In regards to your comment starting this thread. I think the behavior your seeing is from pagination not existing on those endpoints. I haven't seen that many spaces, which is probably why they haven't implemented pagination for them 🤷 , My gut is that they would return all of the spaces if you have >100 we could test it out

visch

10/02/2021, 5:51 PM

I can tell you for Tasks the behavior makes a lot more sense

visch

10/02/2021, 5:52 PM

Unfortunately, if there are less than 100 items, and you pass a page querystring param that _*should*_ return 0 rows, it merrily returns the complete set of records.

I don't agree completely. The initial page is page 0. Page 1 I believe acts appropriately in the case where you have <100 tasks in a folder / folderless portion

visch

10/02/2021, 5:53 PM

It'd be great to see you submit anything extra you need as a PR to that tap we have going!

dustin_miller

10/04/2021, 3:19 PM

Sweet, thanks @visch!

dustin_miller

10/04/2021, 3:21 PM

Will eval on our setup soon

Open in Slack

Previous Next