Stll in the shallows with meltano, just wondered i...
# getting-started
a
Stll in the shallows with meltano, just wondered if someone can confirm my understanding I have arrived at regarding replication working with Rest APIs as taps? In general, taps already know how to apply incremental replication if it is available for certain streams, so you only need to set
INCREMENTAL
and the tap handles it gracefully, including managing meltano state. The flip side is, if the API does not support a
since
timestamp param or similar, you can only get full table updates. The
replication-key
stuff is more for the targets, as in, how to handle upserting new rows into a target database / sink. Again, in general, taps already know the appropriate keys to apply for most streams, but it's up to the target to know what to do when handed the pk. If you're using a target like
jsonl
then no upsert method is supported, and new rows from the tap get appended to the existing file. If you were to use a target that supports upsert syntax, then the replication key would be used for the
on conflict (pk)
for example if you are on postgres. Is that mostly right?
v
Helping someone grok Singer / Meltano 😄 I like it. We still need better ways of getting folks up to speed on this stuff so if you have better ideas please share them! Generally the hub https://hub.meltano.com/singer/spec is a good spot to read if you want to understand it all.
In general, taps already know how to apply incremental replication if it is available for certain streams, so you only need to set
INCREMENTAL
and the tap handles it gracefully, including managing meltano state. The flip side is, if the API does not support a
since
timestamp param or similar, you can only get full table updates.
Generally and ideally yes.
since
is pretty specific to an api but I think I get your point. Just know that a lot of this is tap dependent regarding if they do/don't support incremental for a stream. It's not uncommon for a lot of endpoints to be full table even if they can technically support incremental, but that statement really really depends on the tap. Comes down to the tap is the general thing that you'll hit at some point
The
replication-key
stuff is more for the targets, as in, how to handle upserting new rows into a target database / sink. Again, in general, taps already know the appropriate keys to apply for most streams, but it's up to the target to know what to do when handed the pk. If you're using a target like
jsonl
then no upsert method is supported, and new rows from the tap get appended to the existing file. If you were to use a target that supports upsert syntax, then the replication key would be used for the
on conflict (pk)
for example if you are on postgres.
You're almost there.
replication-key
is actually just for taps, not targets. https://hub.meltano.com/singer/spec#:~:text=the%20replication%20type.-,replication%2Dkey,-All has a bit more info. In your previous paragraph the
since
timestamp would be updated via the
replication-key
which comes from
STATE
, state is passed into
taps
when they are called via
tap-name --config config.json --state state.json --catalog catalog.json
, these 3 are the 3 to generally understand.
state
is the thing that is provided so the
tap
can do something unique based on "where" the last sync left off at.
key-properties
fit with your upsert explanation. https://hub.meltano.com/singer/spec#schemas goes over
key_properties
a bit more.
targets
use
key-properties
to know what to upsert based on.
If you were to use a target that supports upsert syntax, then the replication key would be used for the
on conflict (pk)
for example if you are on postgres.
Generally that's the method that should be use for upsert, but of course it depends on the target's implementation
a
Appreciate it, and thanks for the link to the singer doc, that's a good place to start. I think there are quite a few short, interchangeable nouns that make it confusing at first, stream, state, config, catalog, schema etc 🙂
v
Very true it's a bunch of new made up words
a
Had another think on, and I get the distinction between replication key and key properties / primary key. The
replication_key
is what the tap uses to determine what rows are new/updated, and the
primary_key
is used by the target to add or replace the data as appropriate with the upsert. Sometimes RK and PK are the same if you have a incrementing ID on a sql index and the table row never gets modified, but more often than not, in a standard CRUD app, it will be something like
modified_at
for your replication key and then
id
for your primary key for handling deduplication.