Does anyone have suggestions for how to best perform a one o Meltano #best-practices

Does anyone have suggestions for how to best perfo...

Ellis Valentiner

07/25/2025, 1:19 PM

Does anyone have suggestions for how to best perform a one-off sync of historical data? I have a table that we sync with incremental update and uses a timestamp column for the replication key. The table has "new" records that are not synced to our destination because their timestamp value is before the replication key value.

Reuben (Matatika)

07/25/2025, 2:16 PM

Can you share your

meltano.yml

config? Assume you have a

metadata

section for your tap that defines this incremental behaviour.

Andy Carter

07/25/2025, 3:34 PM

@Ellis Valentiner can you edit / delete your state to trigger a full refresh?

Ellis Valentiner

07/25/2025, 4:24 PM

Here is a redacted version of our meltano.yml with only the 1 table

meltano.yml

Ellis Valentiner

07/25/2025, 4:26 PM

The issue with a full refresh is that I don't want to move more data than necessary. The problem is not necessarily the egress/ingress costs but I don't want to create a really large load on the source database and have a very long running job that would block other tasks.

Ellis Valentiner

07/25/2025, 4:29 PM

I could accomplish this fairly easily with our orchestrator if I could configure both

start_date

and

end_date

(since the data happens to correspond to a certain time period this time). Or if there is some way to define a more complex filter.

Reuben (Matatika)

07/25/2025, 4:36 PM

Pretty sure you would be able to do something like

Copy code

TAP_POSTGRES__MY_SCHEMA__METADATA___REPLICATION_METHOD=FULL_TABLE meltano run tap-postgres--my-schema ...

I'm fuzzy on the exact environment variable, but you would be able to check with

Copy code

meltano config tap-postgres--my-schema list

Ellis Valentiner

07/25/2025, 4:38 PM

Yes but I don't want to perform a full table replication. It would take days and most of the data is unchanged.

Reuben (Matatika)

07/25/2025, 4:44 PM

Ah right, sorry I misunderstood. I think you would have to do something like create a custom state entry and reference it with

--state-id-suffix

then? Or else invoke the tap directly passing

--state state.json

Ellis Valentiner

07/25/2025, 4:47 PM

Interesting. What would the

state.json

actually look like in this case?

Ellis Valentiner

07/25/2025, 4:52 PM

As a very simple example suppose I want to, as a one-off task, replicate the data from only last month and not this month nor the prior months?

Ellis Valentiner

07/25/2025, 4:57 PM

I think the solution I'm going to try is: 1. Create a temporary view or an unlogged table in my Postgres source that contains the subset of records I want to replicate 2. Create a one-off

meltano.yml

that selects ONLY that stream AND configure the

__alias__

3. Run full table replication on that stream

👍 1

Reuben (Matatika)

07/25/2025, 5:00 PM

Probably very similar to your existing state, just with a tweak to the `replication_key_value`:

Copy code

# meltano state list
meltano state get <state id> | jq .singer_state > state.json

Ellis Valentiner

07/25/2025, 5:05 PM

I understand how to modify the

replication_key_value

to override which values meltano starts to replicate from but I don't understand how to stop it from replicating data past a certain end point. As far as I know meltano only supports replication that is entirely unbounded (full table) or left-bounded ranges (incremental from a starting point) but doesn't support any form of right-bounded ranges

👍 1

Ellis Valentiner

07/25/2025, 5:06 PM

I would very much like to be wrong about that.

Reuben (Matatika)

07/25/2025, 5:07 PM

or actually you could just edit your existing state, given that it will update at the end of the run anyway.

Reuben (Matatika)

07/25/2025, 5:16 PM

I think you are correct, or at least it depends on the tap.

15 Views

Open in Slack

Previous Next