Hi everyone! Need advice: So I'm working on a tap...
# singer-tap-development
a
Hi everyone! Need advice: So I'm working on a tap that extracts data from API, this API have a specific: • API can be queried with granularity of one day, i.e. I can ask for events starting from 2023-02-24 but not from specific time What I need to do: • Express in stream schema, that stream is incremental • Set incrementality to synthetic date field • Emit bookmark manually when all the records for a specific date are emitted • Iterate by date in REST API calls Questions: • Is RESTStream a good base for such tap? • How do I provide information about date? Do I ingest sythetic field like
event_date
to each event and rely on built-in functionality? • How do I ensure that bookmark is emitted at the end of the batch?
My current issue is that I do not understand how to correctly inject
event_date
if my Stream is based on RESTStream. My intuition is that I should do it in
post_process
, but I do not have a reliable
event_date
inside of the response, I would like to explicitly put the value, that I provided in
get_url_params
. Currently I do not understand how to pass state from
get_url_params
to
post_process
So I ended up redefining
request_records
with my custom logic
Had to dig up
Copy code
self.finalize_state_progress_markers()
self._write_state_message()
To send proper state change messages after each chunk
v
https://github.com/AutoIDM/tap-indeed/blob/main/tap_indeedsponsoredjobs/streams.py#L61 I'm not super proud of the implementation but maybe you can pull some ideas from here?
This implementation isn't great for a few reasons the biggest of which is here https://github.com/AutoIDM/tap-indeed/blob/main/tap_indeedsponsoredjobs/streams.py#L152 But maybe you can improve on it, I'll take any ideas 🙂. It's at the point that it works right now so I haven't thought too much more about it 😄
a
I see
I thought about using context as a store for global vars
@visch thanks! I'll get a closer look, it seems close enough to my task
v
We did hit some issues with using the context for storing global vars. I think using a class property would be better because we're hitting
Copy code
tate file contains duplicate entries for partition: {state_partition_context}.[0m [36mcmd_type[0m=[35mextractor[0m [36mname[0m=[35mtap-indeed-retractedd[0m [36mrun_id[0m=[35me08ff2e0-9117-4b20-a115-a4c9820840e4[0m [36mstate_id[0m=[35mextract-tap-indeed-retractedd[0m [36mstdio[0m=[35mstderr[0m"
24 February 2023,12:09:37 MST,prefect.extract/load: indeed master account c,INFO,"[2m2023-02-24T07:09:37.048874Z[0m [[32m[1minfo     [0m] [1mMatching state values were: [{'context': {'_sdc_employer_id': 'retracted'}, 'replication_key': '_sdc_start_date', 'replication_key_value': '2023-02-22'}, {'context': {'_sdc_employer_id': 'retracted'}, 'replication_key': '_sdc_start_date', 'replication_key_value': '2023-02-23'}][0m [36mcmd_type[0m=[35mextractor[0m [36mname[0m=[35mtap-indeed-retractedd[0m [36mrun_id[0m=[35me08ff2e0-9117-4b20-a115-a4c9820840e4[0m [36mstate_id[0m=[35mextract-tap-indeed-retractedd
Other than that it seems to work pretty good 🤷
Not certain exactly what's causing this one yet, going to look soon!
a
So far, I ended up overriding
request_records
and found a field in data that correlates with my queries: https://github.com/epoch8/tap-appmetrica/blob/master/tap_appmetrica/client.py#L66
a
Hi, @andrey_tatarinov. Hoping to better understand the desired behavior. • Assuming successful sync, do you want to bookmark the timestamp of the latest record, and then in the next sync, to ignore/drop records already sent? • Are the records sorted by timestamp when returned from the API? • Do I understand correctly that the API will only paginate through a single date at a time?
a
• Assuming successful sync I want to bookmark last date I provided to API to query records • Records do not necessarily have this timestamp (it is an event tracking system, records have "timestamp it happened on user device" and do not have "timestamp server recorded this event" which we use to query data by date) • One or several dates, yes. Granularity is one day
a
@andrey_tatarinov - thanks for this additional info. This helps a lot. I'm still not sure on one point, which is whether you would need the date used in pagination (server-side receipt date, I presume) also within the records - and if yes, does this necessitate pulling just a single date at a time (chunk_days=1) or if perhaps there's another way you're getting that date if not from the request context you're sending in the get_url_args().
Even from your comments above, I can call out a few hard things here that the SDK doesn't plan for our expect as of now: 1. The incremental sync method expects that
replication_key
exists on the records themselves. I am curious if this is throwing an error for you, or if you have perhaps worked around that successfully. 2. The
context
param dict could in theory have something like "server_capture_window_end_date" injected to it (which then could be added to records via
post_process()
), but modifying context not a previously tested pattern and might have implications I'm not thinking of. 3. The combination of simultaneously tracking finalized and non-finalized state markers in the same stream is another pattern not tested within the SDK. It might work totally fine, but I cannot say for sure without driving deeper into the code.
None of this is to say it can't or wouldn't work, just wanted to make sure we're setting healthy expectations on what challenges might be presented with this API pattern.
v
https://github.com/AutoIDM/tap-indeed/pull/25 fixed some of the drawbacks this had before, no more using
context
to inject things which breaks how
state
is handled today in some ways
a
^ yeah, I thought about this approach as well, basically you're using Stream as a container for several global vars 🙂 Didn't know if it was safe to do so, apparently it is 🙂
v
Yeah as long as we're not parallel I think it's alright 😅