anyone have a good pattern for using meltano to handle not o Meltano #best-practices

anyone have a good pattern for using meltano to ha...

James Meneghello

11/12/2024, 6:50 AM

anyone have a good pattern for using meltano to handle not only soft deletes, but soft updates? basically, tracking the previous state of modified records (within reason, obviously)

Andy Carter

11/12/2024, 8:10 AM

Do you have an 'updated_at' field? Can you add that to your list of primary keys assuming you are using upsert logic in your target?

key-properties: ['id', 'updated_at']?

👌 1

James Meneghello

11/12/2024, 1:06 PM

that is a simple and great idea, thank you

haleemur_ali

11/13/2024, 10:29 PM

another option might be to specify that there are no

key-properties

haleemur_ali

11/13/2024, 10:30 PM

the target wpuld then resort to appending instead of upserting rows. might be a bit fatser than an upsert involving a compound key.

Andy Carter

11/14/2024, 9:11 AM

Good shout, I have assumed you are using some sort of state marker so you would only get updated rows in your new API call.

Andy Carter

11/14/2024, 9:11 AM

TMTOWTDI!

haleemur_ali

11/14/2024, 11:50 AM

yeah, that was my assumption as well.

haleemur_ali

11/14/2024, 11:54 AM

james, you should note that the

replication-key

and

key-properties

stream metadata fields act independently.

replication-key

is stored by meltano in the state and is used for incremental extracts

key-properties

are typically used in a couple of places: (1) deduplicate records before loading (2) upsert data from staging to target table. if

replication-key

is absent, then meltano resorts to full extracting strategy (vs incremental), if

key-properties

is absent, then meltano resorts to

append-only

loading strategy (vs upsert)

Andy Carter

11/15/2024, 12:05 PM

@haleemur_ali in your experience is upsert in a db like postgres really much slower than insert? I have a full db sync to set up with meltano, mulling over upsert or using

activate_version

haleemur_ali

11/15/2024, 5:58 PM

Its not that upsert is much slower, postgres is quite well optimized, however, you would be asking the database to do some additional work. The greatest optimization trick to achieve high performance is doing the least amount of work. Any potential performance impact might not even be noticeable in the real world, but this statement is founded on assumptions. Hyper simplified example in pseudo code where

stg

final

are referring to the staging and final target tables, and

UniquenessViolation

represents a the pseudo code generated from the

ON CONFLICT

clause: append:

Copy code

for row in stg:
    insert(row, final)

upsert/merge:

Copy code

for row in final:
    if UniquenessViolation(row, final):
        update(row, final)
    else:
        insert(row, final)

postgres performs updates by adding a new tuple, and marking the existing tuple as stale, and the stale tuples are cleaned up later in a vacuum process. note that insertion incurs no vaccum cost, but updates, upserts & deletes do, and the cost is related to the actual number of tuples that have to be marked stale. Since this particular use-case is to result in new rows appending to the table, there should be close to 0 vaccuum cost The additional cost of the executing upsert over the insert statement is the branching. Modern processors have branch prediction, so it will likely be close to 0, as

UniquenessViolation(row, final)

should be mostly false since each check would utilize the timestamp (again, this statement assumes that the machine code that does execute enables branch prediction)

15 Views

Open in Slack

Previous Next