I have a question about the `INCREMENTAL` select i...
# troubleshooting
m
I have a question about the
INCREMENTAL
select in
tap-postgres meltanolabs
. Am I right that it takes all the records that
>=
than the last value from the state? If that's correct may I ask if that was made on purpose? I would assume to see just
>
there. Thank you!
h
This is on purpose, yes. Many source systems don’t have perfectly increasing high-water marks, so it was thought of as a pragmatic precaution. Since most target systems do some form of deduplication (typically a merge into), it is usually not a problem. But it is definitely a trade off.
m
@Henning Holgersen thank you for the answer! In my case I want to keep changes in a history scd table and I just append new data using the target-snowflake loader. Am I right that if I want to keep it this way I should fork the repository and maintain it further? Just want to confirm there is no easier way to do that, right?
h
It might be that someone else has a better answer, but for your specific use case my best suggestion is exactly that: fork the repo, change the one condition, and call it a day.
p
@mykola_zavada check out these docs https://sdk.meltano.com/en/latest/implementation/at_least_once.html. This is a convention that singer uses, if there are ever ties you would miss them if you used only
>
. For example if your replication key is a date then a sync in the morning would replicate data and store the latest date
2023-05-06
in the state file, if later in the day you get additional records with that same date
2023-05-06
then tomorrow when you run
> 2023-05-06
you'll miss those new records post sync on
2023-05-06
.
m
@pat_nadolny thank you for sharing! As I understand you were looking to the same issue? How did you resolved it eventually? I think about forking the postgres tap just to replace
>=
with
>
as I have a timestamp column as the replication key. I'm ok with taking that risk of possible loosing some data.
u
@mykola_zavada for my use case I'd still send duplicates if I ran my pipeline twice in the same day which I'm not doing right now. So I've avoided the problem rather than resolved it really. Definitely add your thoughts to those issues so we can hopefully get closer to a generalized solution but yeah if you need that change now it would be easiest to use your own fork