Stll in the shallows with meltano just wondered if someone c Meltano #getting-started

Stll in the shallows with meltano, just wondered i...

Andy Carter

02/22/2023, 9:37 AM

Stll in the shallows with meltano, just wondered if someone can confirm my understanding I have arrived at regarding replication working with Rest APIs as taps? In general, taps already know how to apply incremental replication if it is available for certain streams, so you only need to set

INCREMENTAL

and the tap handles it gracefully, including managing meltano state. The flip side is, if the API does not support a

since

timestamp param or similar, you can only get full table updates. The

replication-key

stuff is more for the targets, as in, how to handle upserting new rows into a target database / sink. Again, in general, taps already know the appropriate keys to apply for most streams, but it's up to the target to know what to do when handed the pk. If you're using a target like

jsonl

then no upsert method is supported, and new rows from the tap get appended to the existing file. If you were to use a target that supports upsert syntax, then the replication key would be used for the

on conflict (pk)

for example if you are on postgres. Is that mostly right?

visch

02/22/2023, 1:32 PM

Helping someone grok Singer / Meltano 😄 I like it. We still need better ways of getting folks up to speed on this stuff so if you have better ideas please share them! Generally the hub https://hub.meltano.com/singer/spec is a good spot to read if you want to understand it all.

In general, taps already know how to apply incremental replication if it is available for certain streams, so you only need to set
INCREMENTAL
and the tap handles it gracefully, including managing meltano state. The flip side is, if the API does not support a
since
timestamp param or similar, you can only get full table updates.

Generally and ideally yes.

since

is pretty specific to an api but I think I get your point. Just know that a lot of this is tap dependent regarding if they do/don't support incremental for a stream. It's not uncommon for a lot of endpoints to be full table even if they can technically support incremental, but that statement really really depends on the tap. Comes down to the tap is the general thing that you'll hit at some point

The
replication-key
stuff is more for the targets, as in, how to handle upserting new rows into a target database / sink. Again, in general, taps already know the appropriate keys to apply for most streams, but it's up to the target to know what to do when handed the pk. If you're using a target like
jsonl
then no upsert method is supported, and new rows from the tap get appended to the existing file. If you were to use a target that supports upsert syntax, then the replication key would be used for the
on conflict (pk)
for example if you are on postgres.

You're almost there.

replication-key

is actually just for taps, not targets. https://hub.meltano.com/singer/spec#:~:text=the%20replication%20type.-,replication%2Dkey,-All has a bit more info. In your previous paragraph the

since

timestamp would be updated via the

replication-key

which comes from

STATE

, state is passed into

taps

when they are called via

tap-name --config config.json --state state.json --catalog catalog.json

, these 3 are the 3 to generally understand.

state

is the thing that is provided so the

tap

can do something unique based on "where" the last sync left off at.

key-properties

fit with your upsert explanation. https://hub.meltano.com/singer/spec#schemas goes over

key_properties

a bit more.

targets

use

key-properties

to know what to upsert based on.

If you were to use a target that supports upsert syntax, then the replication key would be used for the
on conflict (pk)
for example if you are on postgres.

Generally that's the method that should be use for upsert, but of course it depends on the target's implementation

Andy Carter

02/22/2023, 9:59 PM

Appreciate it, and thanks for the link to the singer doc, that's a good place to start. I think there are quite a few short, interchangeable nouns that make it confusing at first, stream, state, config, catalog, schema etc 🙂

visch

02/22/2023, 10:38 PM

Very true it's a bunch of new made up words

Andy Carter

02/23/2023, 8:19 AM

Had another think on, and I get the distinction between replication key and key properties / primary key. The

replication_key

is what the tap uses to determine what rows are new/updated, and the

primary_key

is used by the target to add or replace the data as appropriate with the upsert. Sometimes RK and PK are the same if you have a incrementing ID on a sql index and the table row never gets modified, but more often than not, in a standard CRUD app, it will be something like

modified_at

for your replication key and then

id

for your primary key for handling deduplication.

Open in Slack

Previous Next