How can I replicate a table with some offset? I ha...
# singer-taps
p
How can I replicate a table with some offset? I have a very large table and for test purposes would like to only the recent records. I use tap-oracle. I think that replication_key_value variable is for this. For test purposes, I created table PERSONS, and with three records PERSON_ID with values 1,2,3. I wanted to set offset in such a way, that only the record 3 would be replicated. I first tried to set up replication-key-value in meltano.yml, but the variable kept being ignored by meltano. I then tried another method: I saw that I could pass a STATE object to meltano invoke command. So I captured the state (json) from the previous meltano run in a file state.log:
{
"type":"STATE",
"value":{
"bookmarks":{
"PLASZPA-PERSONS":{
"last_replication_method":"INCREMENTAL",
"replication_key":"PERSON_ID",
"replication_key_value":3
}
}
}
}
and then I invoked tap-oracle with this state:
meltano invoke tap-oracle -s state.log
But again, the replication-key-value was ignored - all three records with PERSON_ID values 1,2, and 3 got replicated. Is this because this variable is not supported by tap-oracle?
👀 1
e
I first tried to set up replication-key-value in meltano.yml
what does that
meltano.yml
look like?
p
Here is the content of meltano.yml, with some sensitive data replaced by *: version: 1 default_environment: dev project_id: e3cb5953-cc6c-4546-8311-a4d841f8d5b8 environments: - name: dev - name: staging - name: prod plugins: extractors: - name: tap-github variant: meltanolabs pip_url: git+https://github.com/MeltanoLabs/tap-github.git - name: tap-oracle variant: s7clarke10 pip_url: git+https://github.com/s7clarke10/pipelinewise-tap-oracle.git config: host: * port: 1597 common_service_name: * user: PLASZPA service_name: * # default_replication_method: FULL_TABLE filter_schemas: PLASZPA filter_tables: - PLASZPA-PERSONS metadata: PLASZPA-PERSONS: replication-method: INCREMENTAL replication-key: KW_UPDATE_DATE
e
I can't see anything wrong with this config or with the tap code. Maybe @Steve Clarke knows.
p
Thank you Edgar. Meanwhile, can you at least confirm that this functionality and these variables (replication-key, replication-key-value) are supported by tap-oracle extractor, and that the expected result should be what I think? (I want to note that in the log I saw an ambiguous warning, which was not an error, and read something like "method possibly not supported by plugin".
s
Hi @Pawel Plaszczak , like @edgar_ramirez_mondragon mentioned I am unsure why the incremental method would not work. You can see by this line https://github.com/s7clarke10/pipelinewise-tap-oracle/blob/53bb75ed27d7796d2f492e74cec87f10f1bce4d4/tap_oracle/sync_strategies/incremental.py#L62 that it supports bookmarks against the stored the last replication key value. Is it a table or a view? Sometimes it doesn’t like it if it is a view. I have seen this with tap sybase. I have sometimes had to override and provide config to say the object is a table to get the incremental working (even though it is a view). I have mentioned this about a year or two ago in posts, a search on the correct config for that might be worth it. Sorry answering this on my phone with limited access to a device to provide a more specific answer.
It may be this setting, "is-view": false
p
Hello, Steve, Edgar and all, and thank you so much for the help so far. I have a nice update. It works. I seriously don't know why, because meanwhile I changed my entire configuration. Now I am copying data from remote Oracle to Postgres, and now that I revisited this issue, it works. However, I still have question: With meltano invoke, I can predefine the required state in json file (like shown earlier in my message) and then this works correctly:
$ meltano invoke tap-oracle -s mystate.json
However, if I want to run "meltano run", there is no -s option, so I cannot pass the state. I was only able to ignore the state altogether, like this, and this works:
$ meltano run tap-oracle target-postgres --no-state-update
So, how could I define the required state for the entire etl? Also, alternatively, how does meltano/singer store the state and how could I reset it? I found the file ./.meltano/run/tap-oracle/state.json that looked like the state storage, but deleting this file did not help - obviously the state is stored elsewhere. I also tried various combinations of following two variables in the meltano.yml, like you suggested,: replication_key_value and is_view, but both variables seem to be ignored, no matter what I put there:
metadata:
PLASZPA-PERSONS2:
replication-method: INCREMENTAL
replication-key: KW_UPDATE_DATE
replication_key_value: 2022-12-30T00:00:00.00+00:00
is-view: false
So what would be the correct and recommended way to define the state for the oracle-postgres ETL, so that I can replicate, for instance, only the last five thousand rows from a table of 10 million rows?
Update: I found the solution. This command-line option is indeed missing in run command, but it exists in invoke command. Hence, I succeeded with the expected funtionality using this command syntax:
Copy code
$ meltano invoke tap-oracle -s mystate.json|meltano invoke target-postgres
The only caveat is that target-postgres, if invoked like this with invoke, will write to schema melty while, if invoked with "run", it will write to schema tap-oracle. To enforce consistent target schema across both commands, I had to add this variable to meltano.yml:
default_target_schema: tap_oracle
And then finally everything works as expected: I can define the requested start state in mystate.json, and the target table, in the requested schema tap_oracle, gets populated with the data younger than the state. @Steve Clarke and @Edgar RamĂ­rez (Arch.dev), thank you for your support, I am moving on with the project!
e
I'm glad you figured it out @Pawel Plaszczak! The fact that
-s <state file>
works but
meltano run
doesn't makes me think the plugin in
meltano.yml
is missing
capabilities: [state]
🤔
👍 1
s
Glad to hear you are on the right track 😊 .