quinn_batten
02/14/2023, 5:07 AMstream_maps
and stream_map_config
.
Based on my read of that part of the docs, I’d expect that, with a tap that has those config options, I should be able to:
• make a new column col_a
via stream_maps,
• override the incremental key of that stream to be col_a
, by setting _"__key_properties__": ['col_a']
, and then
• load that table incrementally (on col_a
, since it’s overridden to be the incremental key).
Is that correct? Or am I misunderstanding the SDK’s features?alexander_butler
02/14/2023, 4:10 PMquinn_batten
02/14/2023, 4:11 PMaaronsteers
02/14/2023, 4:59 PMreplication_key
was overridable in stream maps. I checked the docs and I don't see any mention of the capability.
https://sdk.meltano.com/en/latest/stream_maps.html#unset-or-modify-the-stream-s-primary-key-behavior
As I think more on this, there may be an order-of-operations challenge here, where the stream maps being applied after records are generated are not higher enough in the stack to override how the tap actually deals with incremental bookmarks.quinn_batten
02/14/2023, 5:02 PMquinn_batten
02/14/2023, 5:06 PMaaronsteers
02/14/2023, 8:58 PMalexander_butler
02/14/2023, 9:00 PMalexander_butler
02/14/2023, 9:04 PMBut I guess we still can’t load on any special logic whatsoever, even something as simple as a concat of two columns.It helps if you are specific with the tap you have in mind. I cannot imagine a concat being used for replication vs a timestamp-like field. But the SDK handles non timestamp incremental stuff too if for example a job id, page number, offset, or string id drives how to get new data from the source. But lets take a step back... It almost feels like what you are talking about is deduplication. IE I pull data incrementally but I want to make sure a hash of two fields is unique in the target. Which is significantly different problem that most of us solve with dbt but some targets can use key_properties for merges.
quinn_batten
02/14/2023, 9:05 PMalexander_butler
02/14/2023, 9:30 PMColumn
object by key from the sqlalchemy table. So it would need to be hacked at to instead use sqlachemy.func.coalesce
or something with multiple columns. I am not sure if passing a list into the catalog replication key would fly though.quinn_batten
02/14/2023, 9:34 PMaaronsteers
02/14/2023, 10:33 PMreplication_key
) is not supported in SDK, and probably not in other implementations either. 🙁
@quinn_batten - While I was thinking of really clever ideas, I had to pause to make sure I first ask the boring one:
Since you mentioned in the issue of your use cases is Postgres, do you have the ability to use Postgres log-based replication instead of column-based replication? In that case, you wouldn't need to deal with the functional gap with neither updated-at nor created-at giving you the desired increment.quinn_batten
02/14/2023, 10:38 PMaaronsteers
02/14/2023, 10:40 PMupdated_at
column you pick for a replication key will sometimes have a case where an admin or someone changed/fixed rows on the backend without incrementing the timestamp column. And log-based has the advantage of working even when there is no updated_at
-like column on the table.quinn_batten
02/14/2023, 10:55 PMaaronsteers
02/14/2023, 10:56 PM