hey hello! I am evaluating Meltano for our ETL pro...
# singer-taps
p
hey hello! I am evaluating Meltano for our ETL project. Quite excited about what I read in the docs, but then the reality hit 😉 First question: what is the currently recommended way of connecting Meltano to Oracle database as source? I am quite confused here. My platform is Ubuntu, connecting to remote Oracle source. I followed instructions to install extractor tap-oracle. This resulted in series of warnings and errors. I resolved some of those by installing dependencies, but ended up with cryptic "plugin configuration is invalid / Exception: Unrecognized replication_method None". I cannot go further. I also saw that in the command I can turn on debug-type logging:
Copy code
meltano config tap-oracle test
but its output is cryptic. So for now, I abandonned this route, because I found another: I spotted that the errors included warnings that said that cx-oracle was outdated, and recommended python-oracle driver instead. So I thought maybe this would be a better route. I went to the python-oracle page: https://python-oracledb.readthedocs.io/en/latest/user_guide/installation.html and installed python-oracle and tested that it connects to my database. So I have partial success but what next? Now - okay, python-oracle works, but now how to come back to Meltano and make this working driver part of the Meltano configuration? How to make Meltano upload Oracle table using this working python-pracle config? Shall I still somehow use the tap-oracle extractor (how)? I also saw that there are alternative implementation of Oracle connector on this page: https://hub.meltano.com/extractors/tap-oracle/ but since I've already been spending quite some time, I prefer to ask before trying. In short, what's the recommended and the most standard way to connect Meltano to Oracle database as source of data? Oracle (source) is remote, while the local data destination (on the same machine as Meltano install) can be whatever - initially, for the prototyping phase, I'd say PostgreSQL
v
Unrecognized replication_method None
See https://docs.meltano.com/guide/integration/#replication-methods New taps tend to be written with the singer sdk and they default to full_table replication so folks don't hit the kind of thing
What's the recommended and the most standard way to connect Meltano to Oracle database as source of data?
Most objective answer there is https://hub.meltano.com/extractors/tap-oracle Steve Clark's fork of pipeline wise seems to be the most used.
I went to the python-oracle page: https://python-oracledb.readthedocs.io/en/latest/user_guide/installation.html and installed python-oracle and tested that it connects to my database. So I have partial success but what next? Now - okay, python-oracle works, but now how to come back to Meltano and make this working driver part of the Meltano configuration
You could write a tap from scratch here https://sdk.meltano.com/en/v0.43.1/ to oracle, using that driver if it's supported by sqlalchemy (looks like it is) Generally your questions point to needing to understand Singer a bit better ( https://hub.meltano.com/singer/spec is a good resource that can you get up to speed pretty quickly) , specifically catalogs. Unfortunately tap-oracle isn't used by a ton of folks (I use it everyday in production) so it hasn't been "modernized" to make it easier for new folks to hop in and play around. You'll get there, it's worth the time here.
Oracle (source) is remote, while the local data destination (on the same machine as Meltano install) can be whatever - initially, for the prototyping phase, I'd say PostgreSQL
Sure that's fine, you could also start with jsonl / stdout to get the hang of things ie
meltano invoke tap-oracle > out
, just depends on how deep you want to go.
p
You answered my questions in great detail and also the next one I was going to ask. I understand now that learning Singer is prerequisite to Meltano, which I now intend to do right after year-end holidays. Your answer is much appreciated, thank you I will continue soon.
dancingpenguin 1
np 1
v
No problem, welcome!
p
Thank you again. Some success here, I read the docs and made the tap-oracle work. It was quite helpful. The full table replication works. I have some problem setting up the incremental replication (I cannot do log-based replication because I have not enough privileges on the source database). Regarding the incremental replication on tap-oracle, I get the "no replication key selected for key-based incremental replication" error. I understand it but cannot find the syntax of meltano.yml to set it up. I am trying this but still getting an error. Any hints? default_replication_method: INCREMENTAL replication_key: PERSON_ID replication_key_value: 1 I also tried with hypen (replication-key) with no effect. I suppose I need some form of nested config file (so that I can define separate replication key for each replicated table). I'd lik to see an example
👀 1
v
Where'd you come up with
replication_key: PERSON_ID
?
p
messages crossed, I just edited mine while you edited yours. I tried replication-key also, no effect. I am sure I just have wrong syntax of meltano.yml. I think this needs to be defined separately per table (not globally where I did put it) but I fail to find an example of such meltano.yml
v
Share your meltano.yml file
p
This is the file. How should I set it up, so that for the table MYSCH.PERSONS the replication key is set to column PERSON_ID (and for other tables, added later, that would be a different column)? config: host: * port: * common_service_name: * user: * service_name: * default_replication_method: INCREMENTAL replication-key: PERSON_ID replication_key_value: 1 # default_replication_method: LOG_BASED filter_schemas: MYSCH filter_tables: - MYSCH-PERSONS
v
The entire file, including your metadata overrdie above please
p
version: 1 default_environment: dev project_id: * environments: - name: dev - name: staging - name: prod plugins: extractors: - name: tap-github variant: meltanolabs pip_url: git+https://github.com/MeltanoLabs/tap-github.git - name: tap-oracle variant: s7clarke10 pip_url: git+https://github.com/s7clarke10/pipelinewise-tap-oracle.git config: host: * port: * common_service_name: * user: MYSCH service_name: * default_replication_method: INCREMENTAL default-replication-key: PERSON_ID replication_key_value: 1 filter_schemas: MYSCH filter_tables: - MYSCH-PERSONS
yes I did read this page (you did metion this already). I understand the concept, I just need an example config
is "some-entity-id" a table name?
e
is "some-entity-id" a table name?
Yup. For DB extractors it usually includes the schema/db name, e.g.
public-users
for a postgres db.
p
success. that was it!
How can I replicate a table with some offset? I have a very large table and for test purposes would like to only the recent records. I think that replication_key_value variable is for this. For test purposes, I created table PERSONS, and with three records PERSON_ID with values 1,2,3. I wanted to set offset in such a way, that only the record 3 would be replicated. I first tried to set up replication-key-value in meltano.yml, but the variable kept being ignored by meltano. I then tried another method: I saw that I could pass a STATE object to meltano invoke command. So I captured the state (json) from the previous meltano run in a file state.log:
{
"type":"STATE",
"value":{
"bookmarks":{
"PLASZPA-PERSONS":{
"last_replication_method":"INCREMENTAL",
"replication_key":"PERSON_ID",
"replication_key_value":3
}
}
}
}
and then I invoked tap-oracle with this state:
meltano invoke tap-oracle -s state.log
But again, the replication-key-value was ignored - all three records with PERSON_ID values 1,2, and 3 got replicated. Is this because this variable is not supported by tap-oracle?