I have a question about the `INCREMENTAL` select in `tap pos Meltano #troubleshooting

I have a question about the `INCREMENTAL` select i...

mykola_zavada

06/04/2023, 3:33 AM

I have a question about the

INCREMENTAL

select in

tap-postgres meltanolabs

. Am I right that it takes all the records that >=
than the last value from the state? If that's correct may I ask if that was made on purpose? I would assume to see just

there. Thank you!

Henning Holgersen

06/04/2023, 6:11 AM

This is on purpose, yes. Many source systems don’t have perfectly increasing high-water marks, so it was thought of as a pragmatic precaution. Since most target systems do some form of deduplication (typically a merge into), it is usually not a problem. But it is definitely a trade off.

mykola_zavada

06/04/2023, 11:08 AM

@Henning Holgersen thank you for the answer! In my case I want to keep changes in a history scd table and I just append new data using the target-snowflake loader. Am I right that if I want to keep it this way I should fork the repository and maintain it further? Just want to confirm there is no easier way to do that, right?

Henning Holgersen

06/04/2023, 11:09 AM

It might be that someone else has a better answer, but for your specific use case my best suggestion is exactly that: fork the repo, change the one condition, and call it a day.

pat_nadolny

06/05/2023, 2:08 PM

@mykola_zavada check out these docs https://sdk.meltano.com/en/latest/implementation/at_least_once.html. This is a convention that singer uses, if there are ever ties you would miss them if you used only

. For example if your replication key is a date then a sync in the morning would replicate data and store the latest date

2023-05-06

in the state file, if later in the day you get additional records with that same date

2023-05-06

then tomorrow when you run

> 2023-05-06

you'll miss those new records post sync on

2023-05-06

pat_nadolny

06/05/2023, 2:09 PM

Theres a couple open issues related to this https://github.com/meltano/sdk/issues/161 and https://github.com/meltano/sdk/issues/1200

mykola_zavada

06/05/2023, 2:37 PM

@pat_nadolny thank you for sharing! As I understand you were looking to the same issue? How did you resolved it eventually? I think about forking the postgres tap just to replace

>=

with

as I have a timestamp column as the replication key. I'm ok with taking that risk of possible loosing some data.

user

06/05/2023, 2:45 PM

@mykola_zavada for my use case I'd still send duplicates if I ran my pipeline twice in the same day which I'm not doing right now. So I've avoided the problem rather than resolved it really. Definitely add your thoughts to those issues so we can hopefully get closer to a generalized solution but yeah if you need that change now it would be easiest to use your own fork

3 Views

Open in Slack

Previous Next