Sorry another one from me today. Using target-pos...
# troubleshooting
a
Sorry another one from me today. Using target-postgres I am getting an error like:
Copy code
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint
Key (id, record_date)=(18013124639396310/insights/comments/lifetime, 2024-08-01) already exists
I've seen this before, and generally it's when I'm trying to upsert a batch and the batch ends up with 2 or more rows with the same primary key. But when I run this tap to `target-jsonl`I can only see one json value with the offending PK combination. Any reason why target-postgres might be receiving two rows the same, which aren't present if I run locally to
target-jsonl
?
đź‘€ 1
v
Silly question but is the row already in your DB?
c
so there are a few ways in traditional extract/load scenarios where this can happen - a common scenario is if your bookmark field is a timestamp, you have to start the next job at that timestamp because other rows could have been added at that exact timestamp (this is not a meltano/singer specific situation - I am assuming you are not using replication log here for this) and you are, as a result, capturing the tail end of the row(s) at that timestamp for the extract/load process that ran previously. The way I get around this in basically every EL system I design is to allow duplicate “primary keys” in my load tables (that is, no unique index at all for those key field(s)) and then later eliminate the duplicates (window SQL function using rownum, most of the time via dbt) for the final downstream products the analysts use.
if you have an autoincrementing integer primary key that you can use as a bookmark field instead, then you don’t have to ever worry about overlap like you do with timestamps and this duplication issue would not occur
in meltano terms the bookmark field concept is called the
replication-key
when using the
replication-method
value
INCREMENTAL
a
Thanks, will look into this a bit more. It's a basic rest tap so no replication log. I don't think it's an issue with 1 record from my tap matching one already in the target,
target-postgres
seems to handle upserts fine. It's just a hunch, but actually think it's my tap emitting two records with the same PK to load into pg in the same batch causing the issue. But yes, the 'no primary key and dedupe everything' would be ideal solution.
Of course, sods law, I'll run my tap in prod now and everything will be fine.
And I can't clear my target schema for the tap in my DWH as there is some historical data in there I wouldn't be able to retrieve from the tap API again.
c
if its two records with the same PK then the target, if its handling upserts, would just insert the first then update based on the second, no?
(I only handle upserts via dbt incremental models, I haven’t considered upserts in target-postgres myself)
a
Yes that would be my expectation, but it's not how it's working out in reality.
c
can you temporarily just update the target to go to a fresh table with no unique key defined and see if after execution this new target table contains the offending row twice to validate your theory?
a
Good idea, I will try tomorrow
I can run with a local postgres instance instead and see what happens.
c
if you do this beware that it will update your state if successful so you’ll need to append the rows to the destination yourself, or roll back the state (something I have never done)
a
I notice this is a child stream
MediaInsights
with
replication_key=None
in the stream definition, but in the parent
Media
stream,
replication_key=timestamp
. I actually define a key for the
MediaInsights
stream in my
meltano.yml
, as I add a
record_date
field using an inline mapping.
But thanks for the tips, will investigate tomorrow
I backed up the table to another db, then truncated the table, and the tap seems to be running fine now. I have absolutely no idea what the issue was 🤷‍♂️