I'm creating a new tap at <https://github.com/ilkk...
# singer-tap-development
i
I'm creating a new tap at https://github.com/ilkkapeltola/tap-sirene/ What this tap does is it queries the public French company database and fetches all company data they have published. But I have challenges using state and resuming from where I left off. I get these kind of info statements in my output
Copy code
2022-06-01T11:41:35.176643Z [info     ] time=2022-06-01 11:41:35 name=target_snowflake level=INFO message=Emitting state {"bookmarks": {"siren": {"replication_key_signpost": "2022-06-01T11:12:43.534679+00:00", "starting_replication_value": "2000-01-01T00:00:01", "progress_markers": {"Note": "Progress is not resumable if interrupted.", "replication_key": "dateDernierTraitementUniteLegale", "replication_key_value": "2008-09-20T04:50:47"}}}} cmd_type=loader job_id=sirene-prod-1 name=target-sf-transferwise run_id=29bdbab7-15b6-4335-9d47-3c5f170904ce stdio=stderr
So it is storing the progress marker, but why does it say as a note "Progress is not resumable if interrupted"? What do I need to change for it to be able to resume? The API itself is a little quirky, I'll describe what is happening in my get_url_params in thread, since I believe it could be something to do with that.
The API accepts a query parameter
q
that can be passed e.g.
dateDernierTraitementUniteLegale:[2000-01-01T00:00:01 to 3000-01-01T00:00:01]
which instructs the API to return all companies that were updated in that time period. The API also accepts a
tri
:
dateDernierTraitementUniteLegale
, so the results will actually be ordered by that field. And the api accepts
debut
, an integer describing the starting point (0 = first result, 100 = 100th result).
nomber
= results per page, max 1000
My first iteration kept the
q
parameter the same throughout, and I kept changing the
debut
to traverse the pagination, but the API will not accept a
debut
larger than 10000. So, I have to keep changing the
q
parameter instead. What I'm doing now is, I'm checking the last record in my query result, taking the update time from that and using that in the new query
dateDernierTraitementUniteLegale:[2000-01-01T00:00:01 to 3000-01-01T00:00:01]
I keep debut at zero and nombre at 1000, and just keep updating the first date in the above query value.
And now, I'm not sure if this somehow doesn't sit well with the state concept. I'm not sure how to get meltano to continue where it left off. Someone got some advice for me?
e
In your case, I think you have to explicitly set
is_sorted = True
i
It's that simple? oh my...
okay
e
or change dateDernierTraitementUniteLegale to a datetime
and not a timestamp
i
I think I will get a schema violation in that case
e
Why not update the schema? It does look like its a datetime, no?
i
No I didn't.
I thought I would get a schema violation πŸ™‚ but I didn't. I'm not sure why I thought I would.
Should I not set is_sorted = True then at all?
I mean, are there side-effects to that
e
Then you don’t need to, but you can πŸ˜‰
i
right, well, I'm going to omit that first and see if this thing works.
e
If you need more complex ways to paginate, you can have a look at some of our work in https://github.com/MeltanoLabs/tap-github/blob/main/tap_github/client.py#L57
i
Cool thanks!
Boo-yah! It works! I ended up adding the is_sorted = True
Thank you Eric!
e
De rien πŸ™‚
i
Okay, so this quite didn't solve it. Now, Meltano is saying that the bookmark is going correctly.
Copy code
2022-06-01T13:02:32.755026Z [info     ] time=2022-06-01 13:02:32 name=target_snowflake level=INFO message=Emitting state {"bookmarks": {"siren": {"replication_key": "dateDernierTraitementUniteLegale", "replication_key_value": "2006-06-02T17:27:34"}, "siret": {"replication_key": "dateDernierTraitementEtablissement", "replication_key_value": "2006-06-02T17:27:34"}}} cmd_type=loader job_id=sirene-prod-1 name=target-sf-transferwise run_id=ab82934e-cf07-48db-a1c9-ac682b555544 stdio=stderr
However, when I re-run meltano, it starts from the beginning. I've tried running the tap alone with poetry, injecting a state file, and that works correctly. For some reason though, Meltano doesn't use the state.
a
try passing --job-id
i
Thank you. I am, it's also in the output above
job_id=sirene-prod-1
πŸ˜•