Sorry another one from me today Using target postgres I am g Meltano #troubleshooting

Sorry another one from me today. Using target-pos...

Andy Carter

08/01/2024, 3:53 PM

Sorry another one from me today. Using target-postgres I am getting an error like:

Copy code

psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint
Key (id, record_date)=(18013124639396310/insights/comments/lifetime, 2024-08-01) already exists

I've seen this before, and generally it's when I'm trying to upsert a batch and the batch ends up with 2 or more rows with the same primary key. But when I run this tap to `target-jsonl`I can only see one json value with the offending PK combination. Any reason why target-postgres might be receiving two rows the same, which aren't present if I run locally to

target-jsonl

👀 1

visch

08/01/2024, 6:34 PM

Silly question but is the row already in your DB?

Charles Feduke

08/01/2024, 7:04 PM

so there are a few ways in traditional extract/load scenarios where this can happen - a common scenario is if your bookmark field is a timestamp, you have to start the next job at that timestamp because other rows could have been added at that exact timestamp (this is not a meltano/singer specific situation - I am assuming you are not using replication log here for this) and you are, as a result, capturing the tail end of the row(s) at that timestamp for the extract/load process that ran previously. The way I get around this in basically every EL system I design is to allow duplicate “primary keys” in my load tables (that is, no unique index at all for those key field(s)) and then later eliminate the duplicates (window SQL function using rownum, most of the time via dbt) for the final downstream products the analysts use.

Charles Feduke

08/01/2024, 7:06 PM

if you have an autoincrementing integer primary key that you can use as a bookmark field instead, then you don’t have to ever worry about overlap like you do with timestamps and this duplication issue would not occur

Charles Feduke

08/01/2024, 7:07 PM

in meltano terms the bookmark field concept is called the

replication-key

when using the

replication-method

value

INCREMENTAL

Andy Carter

08/01/2024, 8:47 PM

Thanks, will look into this a bit more. It's a basic rest tap so no replication log. I don't think it's an issue with 1 record from my tap matching one already in the target,

target-postgres

seems to handle upserts fine. It's just a hunch, but actually think it's my tap emitting two records with the same PK to load into pg in the same batch causing the issue. But yes, the 'no primary key and dedupe everything' would be ideal solution.

Andy Carter

08/01/2024, 8:48 PM

Of course, sods law, I'll run my tap in prod now and everything will be fine.

Andy Carter

08/01/2024, 8:50 PM

And I can't clear my target schema for the tap in my DWH as there is some historical data in there I wouldn't be able to retrieve from the tap API again.

Charles Feduke

08/01/2024, 8:51 PM

if its two records with the same PK then the target, if its handling upserts, would just insert the first then update based on the second, no?

Charles Feduke

08/01/2024, 8:52 PM

(I only handle upserts via dbt incremental models, I haven’t considered upserts in target-postgres myself)

Andy Carter

08/01/2024, 8:55 PM

Yes that would be my expectation, but it's not how it's working out in reality.

Charles Feduke

08/01/2024, 8:56 PM

can you temporarily just update the target to go to a fresh table with no unique key defined and see if after execution this new target table contains the offending row twice to validate your theory?

Andy Carter

08/01/2024, 8:58 PM

Good idea, I will try tomorrow

Andy Carter

08/01/2024, 8:59 PM

I can run with a local postgres instance instead and see what happens.

Charles Feduke

08/01/2024, 8:59 PM

if you do this beware that it will update your state if successful so you’ll need to append the rows to the destination yourself, or roll back the state (something I have never done)

Andy Carter

08/01/2024, 9:00 PM

I notice this is a child stream

MediaInsights

with

replication_key=None

in the stream definition, but in the parent

Media

stream,

replication_key=timestamp

. I actually define a key for the

MediaInsights

stream in my

meltano.yml

, as I add a

record_date

field using an inline mapping.

Andy Carter

08/01/2024, 9:02 PM

But thanks for the tips, will investigate tomorrow

Andy Carter

08/02/2024, 8:35 AM

I backed up the table to another db, then truncated the table, and the tap seems to be running fine now. I have absolutely no idea what the issue was 🤷‍♂️

2 Views

Open in Slack

Previous Next