Hi, I'm running `tap-oracle` and I'm having troubl...
# troubleshooting
a
Hi, I'm running
tap-oracle
and I'm having trouble with numeric data types. If I set
use_singer_decimal = true
, then my Number types get loaded into the target as strings, which I don't want. But if I set
use_singer_decimal = false
, then I get JSON errors with Number datatypes such as:
jsonschema.exceptions.ValidationError: '16' is not of type 'number'
but
16
is a number, and I don't think it's being wrapped in a string or anything, because I don't get the same error when working with integers, only numbers. any ideas how I can work around this?
v
I used
tap-oracle
and push numerics all the time (almost 2 years in production now. I use the https://hub.meltano.com/extractors/tap-oracle--transferwise variant, maybe try that?
You could also put an issue in with the variant you're using. For debugging I'd try to do something like
meltano invoke tap-oracle > out
and try to isolate a record causing your issue. Then you can provide that to the tap maintainer to take a look at
a
thank you! I just tried using the
invoke
functionality and it looks like my number records are coming out as strings
I can try the transferwise variant! a follow-up question on that: how would I filter tables? I don't see a table filtering option
v
Looking at clark's variant which is the one I assume you're using, it's a fork of transfer wise anyway. I think we should take a few steps way back, could you share your
meltano.yml
file?
And also the command you're using to run thisright now that has data being sent with the wrong format
a
Copy code
version: 1
default_environment: dev
project_id: 61b1c699-0054-4908-97e5-268c32aea971
environments:
- name: dev
- name: staging
- name: prod
plugins:
  extractors:
  - name: tap-oracle
    variant: s7clarke10
    pip_url: git+<https://github.com/s7clarke10/pipelinewise-tap-oracle.git>
    config:
      filter_schemas: {schema}
      use_singer_decimal: false
      filter_tables:
      - {table}
      default_replication_method: FULL_TABLE
      port: 1521
      host: {host}
      user: {user}
      service_name: {service}
  loaders:
  - name: target-s3-parquet
    variant: gupy-io
    pip_url: git+<https://github.com/gupy-io/target-s3-parquet.git>
    config:
      s3_path: {path}
      athena_database: {db}
  - name: target-csv
    variant: hotgluexyz
    pip_url: git+<https://github.com/hotgluexyz/target-csv.git>
    config:
      destination_path: output/
elt:
  buffer_size: 52428800
I redacted host info etc., I hope that's alright
the command I've been running is
meltano run tap-oracle target-s3-parquet
and when I do
meltano invoke tap-oracle > out
I see the numbers wrapped in quotes
v
use_singer_decimal: false
probably doesn't matter so much but I'd think
true
would be better. If the number is wrapped in quotes we've got a different issue. I see you don't have a select statement which is interesting. If you run
meltano invoke --dump=catalog tap-oracle > catalog.json
and find the field_name / related table_name does it have the
type
you'd expect? My assumption here is the
type
isn't corret in the catalog.json but we should verify that. Then you need to go look at the
schema
in oracle to verify it's set right, then depending on what's going on you may have an old cached version of the catalog? Or maybe the oracle data type isn't a numeric as you think it is?
A simpler way to get to this is to look at the
out
file you made and look at the related
schema
message, does that have the type you expect?
I can try the transferwise variant! a follow-up question on that: how would I filter tables
Use the
select
statement in meltano. https://docs.meltano.com/reference/command-line-interface#select
a
this is what the entry looks like for the column in the catalog:
Copy code
{
            "multipleOf": 1e-38,
            "type": [
              "number"
            ]
          },
the type is also Number in oracle
in my output it is a number but it's wrapped in quotes
v
I'll look up an example on my end, could you share the
outfile
with schema and record message
a
unfortunately it's sensitive data but I can share a redacted version
Copy code
{"type":"SCHEMA","stream":"table","schema":{"properties":{"column":{"multipleOf":1e-38,"type":["number"]}}}}
{"type":"STATE","value":{"bookmarks":{"table":{"last_replication_method":"FULL_TABLE","version":1675455842886}},"currently_syncing":"table"}}
{"type":"ACTIVATE_VERSION","stream":"table","version":1675455842886}
{"type":"RECORD","stream":"table","record":{"column":"16"}}
so it says
"16"
for the column value but I'd expect
16
without quotes, if I'm understanding Json schemas correctly
that was with singer decimal false. when I set
use_singer_decimal: true
, the column becomes a string in the schema:
Copy code
{
  "format": "singer.decimal",
  "type": [
    "string"
  ]
}
and then it ends up in my s3 parquet file as a string
v
Got it. That
user_singer_decimal
I haven't messed with. Trying to look at an example on my side just hitting some unrelated things
Got mine working finally ha, yes number's are numbers not wrapping in quotes.
Schema
Copy code
"EARNEDCRHRS": {"multipleOf": 1e-38, "type": ["null", "number"]}
Record
Copy code
"EARNEDCRHRS": 1,
Take some digging to find one with an actual decimal set in the record but they are there
Try the other variant I guess would be my first thing. Start with the output file. Odd that clark's would change that but it's possible!
a
yeah I looked at the code and I think it's missing a statement to handle numbers, but I could be wrong
is there a way to use the transferwise variant with a
service_name
instead of an
sid
in the connection config?
v
no idea, but you could probably tweak it
Someday we can get a oracle tap written with the sdk!
Not certain if numerics work out of the box with the sdk though I haven't played with that
It looks like from the docs you might just be able to provide service_name as the
sid
and it might just work. But 🤷 I have to get back to my other stuff feel free to post what you find
a
thank you! I actually may open up a PR on Clark's variant
I'll let you know if I find anything interesting. thanks for taking a look with me! this was really valuable
just in case anyone is interested... I stuck with Clark's
tap-oracle
variant and actually what I'm doing is opening a PR on the
tap-s3-parquet
I'm using to properly handle
singer.decimal
and
datetime
data types
s
Just chiming in here. With an Oracle database numeric and decimal data is quite problematic. The main singer oracle tap (which pipelinewise forked from) moved away from trying emit numeric data and went straight to
singer.decimal
notation without giving you an option some time ago. Reason numeric data was not always extracted correctly. Large numbers, numbers with a large number of decimal places, and floats were not replicated correctly (partially because of internal python handling of these numbers). You have made the right decision sticking with
singer.decimal
because testing has shown that emitting numeric data can lead to rounding issues. It is much safer to switch on
singer.decimal
and use a post process like dbt to cast the data the way you would need it. Here is another problem, if a column is defined as just
number
in Oracle as a datatype. You don't know the scale (number of digits) to set on the target. If your data contains decimals a modern database will require you to define the precision and scale. It is much safer to emit the data with
singer.decimal
set to true.
m
The tap-oracle variant built by my colleague @steve_clarke depends upon my fork of singer-python. I have just released a new version which better handles
decimal.Decimal
types instead of using
str()
, the new version will output a number using float or int depending upon the content of the value. Here's the change: https://github.com/mjsqu/pipelinewise-singer-python/commit/e31c16895c8900dad0bd96b98c43f9a5c6137905
@avery_dagostino - if you reinstall tap-oracle then you should get the following output for your
NUMBER
datatype when `use_singer_decimal=False`:
Copy code
{"type":"SCHEMA","stream":"DSSADM-TAP_TEST_MJ","schema":{"properties":{"COLUMN_NAME":{"multipleOf":1e-38,"type":["null","number"]}},"type":"object"},"key_properties":[]}
{"type":"RECORD","stream":"DSSADM-TAP_TEST_MJ","record":{"COLUMN_NAME":16},"version":1679528872508,"time_extracted":"2023-03-22T23:47:52.508937Z"}
s
Thanks very much @mark_johnston. We moved tap-oracle to using
orjson.dumps
rather than
jsonsimple.dumps
because it has significant speed benefit when it comes to serializing the record messages to json https://github.com/ijl/orjson#performance . This leads to an overall performance gain for the tap. The gotcha comes with handling the decimal datatypes if you don't use singer decimal. This fix should now output data in a more compliant jsonschema notation. I would still highly recommend that anyone using tap-oracle that they enable
"use_singer_decimal": true
. Key Reasons: 1. Oracle can use a data type of just NUMBER without any scale or precision. There is no way of guessing how big the whole number component is and how big the decimal component is. It is much safer out as a string and to cast the string to the data type you want in the target. 2. Both Python number precision and Floats are unreliable it can lead to rounding issues. It is better to output the actual integer or decimal as a string and cast it in the target. We have put some hints in tap-oracle and our variant of target-snowflake to automatically cast the data in Snowflake back to a number if there was a Scale and Precision specified in the Oracle Datatype.