Can we configure meltano to save replication state...
# troubleshooting
s
Can we configure meltano to save replication states during a load? It’s kind of frustrating to encounter an error during a long running initial load or full refresh and have to start over
e
Hi @steven_wang. It's normally the responsibility of the loader to emit partial states, and Meltano will automatically save them as they're generated. What target are you using that's causing state to be lost after a failure?
s
We’re using the snowflake target. I see state get emitted in the logs but seems like it’s not getting written to the state db
e
What command are you using to run your pipeline? And can you confirm you're seeing messages like
Incremental state has been updated at ...
?
s
We’re using meltano run
Seeing stuff like this in the db/logs "{""singer_state"": {""bookmarks"": {""campaigns"": {""replication_key"": ""updated_at"", ""replication_key_value"": ""2023-11-17T213047.139973+00:00""}, ""events"": {""replication_key_signpost"": ""2023-11-17T223716.683686+00:00"", ""starting_replication_value"": null, ""progress_markers"": {""Note"": ""Progress is not resumable if interrupted."", ""replication_key"": ""datetime"", ""replication_key_value"": ""2023-11-17 223742+00:00""}}}}}" 2023-11-18 001945.954301
e
I see. So yeah, if the data comes from the source unsorted, then it would not be safe to use the bookmark after an unexpected interruption because you could be using a more recent timestamp than the oldest record in the dataset. The singer sdk computes a signpost value for this but I don't think it's well exposed to tap developers. What tap is this?
s
the sorting makes sense, I'm not sure if shopify's graphql api returns data sorted, so I need to test that out more. How would I configure meltano/singer to assume the data is sorted?
e
How would I configure meltano/singer to assume the data is sorted?
It's hardcoded in the tap implementation. The tap developer has to explicitly set the
is_sorted
property of the stream class to `True`: https://sdk.meltano.com/en/latest/classes/singer_sdk.Stream.html#singer_sdk.Stream.is_sorted. The tap developer can decide on a case-by-case basis, if the stream's records are sorted by the replication key, on determine that dynamically. For example, there was a recent change in the Singer SDK that benefits all derived SQL taps: https://github.com/meltano/sdk/pull/1951/files#diff-61b20de8a12c7c91dfc4de62be5a230d2b4b1b2dd688628705554dc63a9490b6
So, do log an issue to the tap's maintainer if you know which streams should be treated as sorted, or just asking since they may know too. In particular that tap seems to be missing setting
sortKey
after they determine the replication key: https://github.com/sehnem/tap-shopify/blob/4254766699fc53f5522ce28d6028e50918f380ee/tap_shopify/tap.py#L172