michael_cooper
05/20/2021, 10:11 PMreplication_key
, and this is partially in regards to Singer spec in general as well. Is a stream's replication_key
and key within an individual record that is used a means of bookmarking, or is a key you use to bookmark where a stream left off?
For example we have a hypothetical endpoint GET /api/orders?created_after='2020-01-13'
with a response of
{
"orders": [
{
"order_id": 1,
"customer_id": 2
},
{
"order_id": 2,
"customer_id": 55
}
]
}
Now we want to track where we left off since the last sync by tracking the last time we successfully queried and not have to due a full sync of all orders all the time. Since there is no key within an individual record, does that mean there is no technical replication_key
for this stream? Or is replication_key
arbitrary and just a way to bookmark where this individual stream left off?
For the SDK section, does the SDK utilize a stream's replication_key
in any functional way, or does it only use it for metadata?aaronsteers
05/20/2021, 10:19 PMreplication_key
can optionally be set to the name of any property in the stream’s records which can be used as a bookmark to resume and get just the incrementally new/updated records. And yes, the SDK will try to use that smartly, for instance, by enabling ‘INCREMENTAL’ replication automatically and automatically tracking state bookmarks internally when it detects that property is set.
Depending on your implementation, you would also likely use that prior bookmarked value when you request records from the source, so only new or updated records have to be read.aaronsteers
05/20/2021, 10:20 PMmichael_cooper
05/20/2021, 10:21 PMreplication_key
?aaronsteers
05/20/2021, 10:23 PMmichael_cooper
05/20/2021, 10:27 PMcreated_at
field within an individual record.
If I don't have a bookmark somewhere to track created_after
then I will effectively be doing a full table sync every time, which is not ideal.aaronsteers
05/20/2021, 10:31 PMget_records()
or in the post_process()
.
Would these work for your implementation, or if not, can you tell me a little more about the API. Is this REST?michael_cooper
05/20/2021, 10:34 PMmichael_cooper
05/20/2021, 10:40 PMreplication_key
are technically FULL_TABLE
syncs despite the fact that the stream does not sync all records every sync even if the API supports querying only certain records.aaronsteers
05/20/2021, 10:48 PMaaronsteers
05/20/2021, 10:49 PMaaronsteers
05/20/2021, 11:12 PMcreated
, responsed
, and lastemailed
. From the docs, it looks like you have responded
as a timestamp (epoch format) and you would just use that as the since_time
input and also for your replication key. (Let me know if I’m reading the docs wrong.)aaronsteers
05/20/2021, 11:13 PMaaronsteers
05/20/2021, 11:17 PMget_context_state(context)
, which previous to 0.2.0 was called get_stream_or_partition_state(partition)
. This gives you a readable and writeable state dictionary which you can use to store anything you want. 🙂https://gitlab.com/meltano/singer-sdk/-/blob/main/singer_sdk/streams/core.py#L386aaronsteers
05/20/2021, 11:19 PMmichael_cooper
05/25/2021, 3:50 PMpnadolny
05/25/2021, 6:47 PMsince
parameter that takes a date but the payload I receive back doesnt contain any dates. I think @aaronsteers recommendation might work for me, if I add a extracted_ts
of current time using the post_process
method to each record then I can use that as a replication key. My concern is that a failed stream could bookmark my current time before the entire stream is done causing missing data the next time it runs. Is this a valid concern? Or is there a mechanism to prevent thispnadolny
05/25/2021, 6:50 PMget_url_params
you can do a day diff between the last bookmark and current time then that would be your days
parametermichael_cooper
05/25/2021, 7:02 PMget_url_params
work in this case? This API doesn't use query parameters but instead encodes the "params" as a url path.pnadolny
05/25/2021, 7:16 PMget_url
method that gets called before each request is made so maybe you could do it there