anyone have a good pattern for using meltano to ha...
# best-practices
j
anyone have a good pattern for using meltano to handle not only soft deletes, but soft updates? basically, tracking the previous state of modified records (within reason, obviously)
a
Do you have an 'updated_at' field? Can you add that to your list of primary keys assuming you are using upsert logic in your target?
key-properties: ['id', 'updated_at']?
👌 1
j
that is a simple and great idea, thank you
h
another option might be to specify that there are no
key-properties
the target wpuld then resort to appending instead of upserting rows. might be a bit fatser than an upsert involving a compound key.
a
Good shout, I have assumed you are using some sort of state marker so you would only get updated rows in your new API call.
TMTOWTDI!
h
yeah, that was my assumption as well.
james, you should note that the
replication-key
and
key-properties
stream metadata fields act independently.
replication-key
is stored by meltano in the state and is used for incremental extracts
key-properties
are typically used in a couple of places: (1) deduplicate records before loading (2) upsert data from staging to target table. if
replication-key
is absent, then meltano resorts to full extracting strategy (vs incremental), if
key-properties
is absent, then meltano resorts to
append-only
loading strategy (vs upsert)
a
@haleemur_ali in your experience is upsert in a db like postgres really much slower than insert? I have a full db sync to set up with meltano, mulling over upsert or using
activate_version
h
Its not that upsert is much slower, postgres is quite well optimized, however, you would be asking the database to do some additional work. The greatest optimization trick to achieve high performance is doing the least amount of work. Any potential performance impact might not even be noticeable in the real world, but this statement is founded on assumptions. Hyper simplified example in pseudo code where
stg
&
final
are referring to the staging and final target tables, and
UniquenessViolation
represents a the pseudo code generated from the
ON CONFLICT
clause: append:
Copy code
for row in stg:
    insert(row, final)
upsert/merge:
Copy code
for row in final:
    if UniquenessViolation(row, final):
        update(row, final)
    else:
        insert(row, final)
postgres performs updates by adding a new tuple, and marking the existing tuple as stale, and the stale tuples are cleaned up later in a vacuum process. note that insertion incurs no vaccum cost, but updates, upserts & deletes do, and the cost is related to the actual number of tuples that have to be marked stale. Since this particular use-case is to result in new rows appending to the table, there should be close to 0 vaccuum cost The additional cost of the executing upsert over the insert statement is the branching. Modern processors have branch prediction, so it will likely be close to 0, as
UniquenessViolation(row, final)
should be mostly false since each check would utilize the timestamp (again, this statement assumes that the machine code that does execute enables branch prediction)