Does anyone have suggestions for how to best perfo...
# best-practices
e
Does anyone have suggestions for how to best perform a one-off sync of historical data? I have a table that we sync with incremental update and uses a timestamp column for the replication key. The table has "new" records that are not synced to our destination because their timestamp value is before the replication key value.
r
Can you share your
meltano.yml
config? Assume you have a
metadata
section for your tap that defines this incremental behaviour.
a
@Ellis Valentiner can you edit / delete your state to trigger a full refresh?
e
Here is a redacted version of our meltano.yml with only the 1 table
The issue with a full refresh is that I don't want to move more data than necessary. The problem is not necessarily the egress/ingress costs but I don't want to create a really large load on the source database and have a very long running job that would block other tasks.
I could accomplish this fairly easily with our orchestrator if I could configure both
start_date
and
end_date
(since the data happens to correspond to a certain time period this time). Or if there is some way to define a more complex filter.
r
Pretty sure you would be able to do something like
Copy code
TAP_POSTGRES__MY_SCHEMA__METADATA___REPLICATION_METHOD=FULL_TABLE meltano run tap-postgres--my-schema ...
I'm fuzzy on the exact environment variable, but you would be able to check with
Copy code
meltano config tap-postgres--my-schema list
e
Yes but I don't want to perform a full table replication. It would take days and most of the data is unchanged.
r
Ah right, sorry I misunderstood. I think you would have to do something like create a custom state entry and reference it with
--state-id-suffix
then? Or else invoke the tap directly passing
--state state.json
.
e
Interesting. What would the
state.json
actually look like in this case?
As a very simple example suppose I want to, as a one-off task, replicate the data from only last month and not this month nor the prior months?
I think the solution I'm going to try is: 1. Create a temporary view or an unlogged table in my Postgres source that contains the subset of records I want to replicate 2. Create a one-off
meltano.yml
that selects ONLY that stream AND configure the
__alias__
3. Run full table replication on that stream
👍 1
r
Probably very similar to your existing state, just with a tweak to the `replication_key_value`:
Copy code
# meltano state list
meltano state get <state id> | jq .singer_state > state.json
e
I understand how to modify the
replication_key_value
to override which values meltano starts to replicate from but I don't understand how to stop it from replicating data past a certain end point. As far as I know meltano only supports replication that is entirely unbounded (full table) or left-bounded ranges (incremental from a starting point) but doesn't support any form of right-bounded ranges
👍 1
I would very much like to be wrong about that.
r
or actually you could just edit your existing state, given that it will update at the end of the run anyway.
I think you are correct, or at least it depends on the tap.