Hi everyone Need advice So I m working on a tap that extract Meltano #singer-tap-development

Hi everyone! Need advice: So I'm working on a tap...

andrey_tatarinov

02/24/2023, 7:42 AM

Hi everyone! Need advice: So I'm working on a tap that extracts data from API, this API have a specific: • API can be queried with granularity of one day, i.e. I can ask for events starting from 2023-02-24 but not from specific time What I need to do: • Express in stream schema, that stream is incremental • Set incrementality to synthetic date field • Emit bookmark manually when all the records for a specific date are emitted • Iterate by date in REST API calls Questions: • Is RESTStream a good base for such tap? • How do I provide information about date? Do I ingest sythetic field like

event_date

to each event and rely on built-in functionality? • How do I ensure that bookmark is emitted at the end of the batch?

andrey_tatarinov

02/24/2023, 8:34 AM

My current issue is that I do not understand how to correctly inject

event_date

if my Stream is based on RESTStream. My intuition is that I should do it in

post_process

, but I do not have a reliable

event_date

inside of the response, I would like to explicitly put the value, that I provided in

get_url_params

. Currently I do not understand how to pass state from

get_url_params

post_process

andrey_tatarinov

02/24/2023, 12:16 PM

So I ended up redefining

request_records

with my custom logic

andrey_tatarinov

02/24/2023, 12:17 PM

Had to dig up

Copy code

self.finalize_state_progress_markers()
self._write_state_message()

To send proper state change messages after each chunk

visch

02/24/2023, 1:08 PM

https://github.com/AutoIDM/tap-indeed/blob/main/tap_indeedsponsoredjobs/streams.py#L61 I'm not super proud of the implementation but maybe you can pull some ideas from here?

visch

02/24/2023, 1:09 PM

This implementation isn't great for a few reasons the biggest of which is here https://github.com/AutoIDM/tap-indeed/blob/main/tap_indeedsponsoredjobs/streams.py#L152 But maybe you can improve on it, I'll take any ideas 🙂. It's at the point that it works right now so I haven't thought too much more about it 😄

andrey_tatarinov

02/24/2023, 3:09 PM

I see

andrey_tatarinov

02/24/2023, 3:09 PM

I thought about using context as a store for global vars

andrey_tatarinov

02/24/2023, 3:11 PM

@visch thanks! I'll get a closer look, it seems close enough to my task

visch

02/24/2023, 3:12 PM

We did hit some issues with using the context for storing global vars. I think using a class property would be better because we're hitting

Copy code

tate file contains duplicate entries for partition: {state_partition_context}.[0m [36mcmd_type[0m=[35mextractor[0m [36mname[0m=[35mtap-indeed-retractedd[0m [36mrun_id[0m=[35me08ff2e0-9117-4b20-a115-a4c9820840e4[0m [36mstate_id[0m=[35mextract-tap-indeed-retractedd[0m [36mstdio[0m=[35mstderr[0m"
24 February 2023,12:09:37 MST,prefect.extract/load: indeed master account c,INFO,"[2m2023-02-24T07:09:37.048874Z[0m [[32m[1minfo     [0m] [1mMatching state values were: [{'context': {'_sdc_employer_id': 'retracted'}, 'replication_key': '_sdc_start_date', 'replication_key_value': '2023-02-22'}, {'context': {'_sdc_employer_id': 'retracted'}, 'replication_key': '_sdc_start_date', 'replication_key_value': '2023-02-23'}][0m [36mcmd_type[0m=[35mextractor[0m [36mname[0m=[35mtap-indeed-retractedd[0m [36mrun_id[0m=[35me08ff2e0-9117-4b20-a115-a4c9820840e4[0m [36mstate_id[0m=[35mextract-tap-indeed-retractedd

visch

02/24/2023, 3:12 PM

Other than that it seems to work pretty good 🤷

visch

02/24/2023, 3:13 PM

Not certain exactly what's causing this one yet, going to look soon!

andrey_tatarinov

02/24/2023, 3:14 PM

So far, I ended up overriding

request_records

and found a field in data that correlates with my queries: https://github.com/epoch8/tap-appmetrica/blob/master/tap_appmetrica/client.py#L66

aaronsteers

02/26/2023, 2:16 AM

Hi, @andrey_tatarinov. Hoping to better understand the desired behavior. • Assuming successful sync, do you want to bookmark the timestamp of the latest record, and then in the next sync, to ignore/drop records already sent? • Are the records sorted by timestamp when returned from the API? • Do I understand correctly that the API will only paginate through a single date at a time?

andrey_tatarinov

02/26/2023, 11:52 AM

• Assuming successful sync I want to bookmark last date I provided to API to query records • Records do not necessarily have this timestamp (it is an event tracking system, records have "timestamp it happened on user device" and do not have "timestamp server recorded this event" which we use to query data by date) • One or several dates, yes. Granularity is one day

aaronsteers

02/26/2023, 8:28 PM

@andrey_tatarinov - thanks for this additional info. This helps a lot. I'm still not sure on one point, which is whether you would need the date used in pagination (server-side receipt date, I presume) also within the records - and if yes, does this necessitate pulling just a single date at a time (chunk_days=1) or if perhaps there's another way you're getting that date if not from the request context you're sending in the get_url_args().

aaronsteers

02/26/2023, 8:41 PM

Even from your comments above, I can call out a few hard things here that the SDK doesn't plan for our expect as of now: 1. The incremental sync method expects that

replication_key

exists on the records themselves. I am curious if this is throwing an error for you, or if you have perhaps worked around that successfully. 2. The

context

param dict could in theory have something like "server_capture_window_end_date" injected to it (which then could be added to records via

post_process()

), but modifying context not a previously tested pattern and might have implications I'm not thinking of. 3. The combination of simultaneously tracking finalized and non-finalized state markers in the same stream is another pattern not tested within the SDK. It might work totally fine, but I cannot say for sure without driving deeper into the code.

aaronsteers

02/26/2023, 8:45 PM

None of this is to say it can't or wouldn't work, just wanted to make sure we're setting healthy expectations on what challenges might be presented with this API pattern.

visch

03/01/2023, 2:50 AM

https://github.com/AutoIDM/tap-indeed/pull/25 fixed some of the drawbacks this had before, no more using

context

to inject things which breaks how

state

is handled today in some ways

andrey_tatarinov

03/02/2023, 3:49 PM

^ yeah, I thought about this approach as well, basically you're using Stream as a container for several global vars 🙂 Didn't know if it was safe to do so, apparently it is 🙂

visch

03/02/2023, 3:49 PM

Yeah as long as we're not parallel I think it's alright 😅

4 Views

Open in Slack

Previous Next