Hi I m running `tap oracle` and I m having trouble with nume Meltano #troubleshooting

Hi, I'm running `tap-oracle` and I'm having troubl...

avery_dagostino

02/03/2023, 8:18 PM

Hi, I'm running

tap-oracle

and I'm having trouble with numeric data types. If I set

use_singer_decimal = true

, then my Number types get loaded into the target as strings, which I don't want. But if I set

use_singer_decimal = false

, then I get JSON errors with Number datatypes such as:

jsonschema.exceptions.ValidationError: '16' is not of type 'number'

but

is a number, and I don't think it's being wrapped in a string or anything, because I don't get the same error when working with integers, only numbers. any ideas how I can work around this?

visch

02/03/2023, 8:21 PM

I used

tap-oracle

and push numerics all the time (almost 2 years in production now. I use the https://hub.meltano.com/extractors/tap-oracle--transferwise variant, maybe try that?

visch

02/03/2023, 8:23 PM

You could also put an issue in with the variant you're using. For debugging I'd try to do something like

meltano invoke tap-oracle > out

and try to isolate a record causing your issue. Then you can provide that to the tap maintainer to take a look at

avery_dagostino

02/03/2023, 8:25 PM

thank you! I just tried using the

invoke

functionality and it looks like my number records are coming out as strings

avery_dagostino

02/03/2023, 8:26 PM

I can try the transferwise variant! a follow-up question on that: how would I filter tables? I don't see a table filtering option

visch

02/03/2023, 8:27 PM

Looking at clark's variant which is the one I assume you're using, it's a fork of transfer wise anyway. I think we should take a few steps way back, could you share your

meltano.yml

file?

visch

02/03/2023, 8:28 PM

And also the command you're using to run thisright now that has data being sent with the wrong format

avery_dagostino

02/03/2023, 8:30 PM

Copy code

version: 1
default_environment: dev
project_id: 61b1c699-0054-4908-97e5-268c32aea971
environments:
- name: dev
- name: staging
- name: prod
plugins:
  extractors:
  - name: tap-oracle
    variant: s7clarke10
    pip_url: git+<https://github.com/s7clarke10/pipelinewise-tap-oracle.git>
    config:
      filter_schemas: {schema}
      use_singer_decimal: false
      filter_tables:
      - {table}
      default_replication_method: FULL_TABLE
      port: 1521
      host: {host}
      user: {user}
      service_name: {service}
  loaders:
  - name: target-s3-parquet
    variant: gupy-io
    pip_url: git+<https://github.com/gupy-io/target-s3-parquet.git>
    config:
      s3_path: {path}
      athena_database: {db}
  - name: target-csv
    variant: hotgluexyz
    pip_url: git+<https://github.com/hotgluexyz/target-csv.git>
    config:
      destination_path: output/
elt:
  buffer_size: 52428800

avery_dagostino

02/03/2023, 8:30 PM

I redacted host info etc., I hope that's alright

avery_dagostino

02/03/2023, 8:30 PM

the command I've been running is

meltano run tap-oracle target-s3-parquet

avery_dagostino

02/03/2023, 8:31 PM

and when I do

meltano invoke tap-oracle > out

I see the numbers wrapped in quotes

visch

02/03/2023, 8:34 PM

use_singer_decimal: false

probably doesn't matter so much but I'd think

true

would be better. If the number is wrapped in quotes we've got a different issue. I see you don't have a select statement which is interesting. If you run

meltano invoke --dump=catalog tap-oracle > catalog.json

and find the field_name / related table_name does it have the

type

you'd expect? My assumption here is the

type

isn't corret in the catalog.json but we should verify that. Then you need to go look at the

schema

in oracle to verify it's set right, then depending on what's going on you may have an old cached version of the catalog? Or maybe the oracle data type isn't a numeric as you think it is?

visch

02/03/2023, 8:35 PM

A simpler way to get to this is to look at the

out

file you made and look at the related

schema

message, does that have the type you expect?

visch

02/03/2023, 8:36 PM

I can try the transferwise variant! a follow-up question on that: how would I filter tables

Use the

select

statement in meltano. https://docs.meltano.com/reference/command-line-interface#select

avery_dagostino

02/03/2023, 8:38 PM

this is what the entry looks like for the column in the catalog:

Copy code

{
            "multipleOf": 1e-38,
            "type": [
              "number"
            ]
          },

avery_dagostino

02/03/2023, 8:39 PM

the type is also Number in oracle

avery_dagostino

02/03/2023, 8:39 PM

in my output it is a number but it's wrapped in quotes

visch

02/03/2023, 8:42 PM

I'll look up an example on my end, could you share the

outfile

with schema and record message

avery_dagostino

02/03/2023, 8:47 PM

unfortunately it's sensitive data but I can share a redacted version

Copy code

{"type":"SCHEMA","stream":"table","schema":{"properties":{"column":{"multipleOf":1e-38,"type":["number"]}}}}
{"type":"STATE","value":{"bookmarks":{"table":{"last_replication_method":"FULL_TABLE","version":1675455842886}},"currently_syncing":"table"}}
{"type":"ACTIVATE_VERSION","stream":"table","version":1675455842886}
{"type":"RECORD","stream":"table","record":{"column":"16"}}

avery_dagostino

02/03/2023, 8:48 PM

so it says

"16"

for the column value but I'd expect

without quotes, if I'm understanding Json schemas correctly

avery_dagostino

02/03/2023, 8:49 PM

that was with singer decimal false. when I set

use_singer_decimal: true

, the column becomes a string in the schema:

Copy code

{
  "format": "singer.decimal",
  "type": [
    "string"
  ]
}

avery_dagostino

02/03/2023, 8:49 PM

and then it ends up in my s3 parquet file as a string

visch

02/03/2023, 8:50 PM

Got it. That

user_singer_decimal

I haven't messed with. Trying to look at an example on my side just hitting some unrelated things

visch

02/03/2023, 9:04 PM

Got mine working finally ha, yes number's are numbers not wrapping in quotes.

visch

02/03/2023, 9:05 PM

Schema

Copy code

"EARNEDCRHRS": {"multipleOf": 1e-38, "type": ["null", "number"]}

Record

Copy code

"EARNEDCRHRS": 1,

visch

02/03/2023, 9:05 PM

Take some digging to find one with an actual decimal set in the record but they are there

visch

02/03/2023, 9:06 PM

Try the other variant I guess would be my first thing. Start with the output file. Odd that clark's would change that but it's possible!

avery_dagostino

02/03/2023, 9:07 PM

yeah I looked at the code and I think it's missing a statement to handle numbers, but I could be wrong

avery_dagostino

02/03/2023, 9:08 PM

is there a way to use the transferwise variant with a

service_name

instead of an

sid

in the connection config?

visch

02/03/2023, 9:09 PM

no idea, but you could probably tweak it

visch

02/03/2023, 9:10 PM

https://github.com/transferwise/pipelinewise-tap-oracle/blob/8ad92967878b808c685839d57376eae272c8e06d/tap_oracle/db.py#L10

visch

02/03/2023, 9:10 PM

https://cx-oracle.readthedocs.io/en/latest/user_guide/connection_handling.html#:~:text=dsn%20%3D%20cx_Oracle.mak[…]%20service_name%3D%22orclpdb1%22) Pretty easy tweak for a test 🤷

visch

02/03/2023, 9:11 PM

Someday we can get a oracle tap written with the sdk!

visch

02/03/2023, 9:11 PM

Not certain if numerics work out of the box with the sdk though I haven't played with that

visch

02/03/2023, 9:12 PM

It looks like from the docs you might just be able to provide service_name as the

sid

and it might just work. But 🤷 I have to get back to my other stuff feel free to post what you find

avery_dagostino

02/03/2023, 9:15 PM

thank you! I actually may open up a PR on Clark's variant

avery_dagostino

02/03/2023, 9:15 PM

I'll let you know if I find anything interesting. thanks for taking a look with me! this was really valuable

avery_dagostino

02/06/2023, 4:05 PM

just in case anyone is interested... I stuck with Clark's

tap-oracle

variant and actually what I'm doing is opening a PR on the

tap-s3-parquet

I'm using to properly handle

singer.decimal

and

datetime

data types

steve_clarke

02/13/2023, 7:46 PM

Just chiming in here. With an Oracle database numeric and decimal data is quite problematic. The main singer oracle tap (which pipelinewise forked from) moved away from trying emit numeric data and went straight to

singer.decimal

notation without giving you an option some time ago. Reason numeric data was not always extracted correctly. Large numbers, numbers with a large number of decimal places, and floats were not replicated correctly (partially because of internal python handling of these numbers). You have made the right decision sticking with

singer.decimal

because testing has shown that emitting numeric data can lead to rounding issues. It is much safer to switch on

singer.decimal

and use a post process like dbt to cast the data the way you would need it. Here is another problem, if a column is defined as just

number

in Oracle as a datatype. You don't know the scale (number of digits) to set on the target. If your data contains decimals a modern database will require you to define the precision and scale. It is much safer to emit the data with

singer.decimal

set to true.

mark_johnston

03/23/2023, 12:00 AM

The tap-oracle variant built by my colleague @steve_clarke depends upon my fork of singer-python. I have just released a new version which better handles

decimal.Decimal

types instead of using

str()

, the new version will output a number using float or int depending upon the content of the value. Here's the change: https://github.com/mjsqu/pipelinewise-singer-python/commit/e31c16895c8900dad0bd96b98c43f9a5c6137905

mark_johnston

03/23/2023, 12:02 AM

@avery_dagostino - if you reinstall tap-oracle then you should get the following output for your

NUMBER

datatype when `use_singer_decimal=False`:

Copy code

{"type":"SCHEMA","stream":"DSSADM-TAP_TEST_MJ","schema":{"properties":{"COLUMN_NAME":{"multipleOf":1e-38,"type":["null","number"]}},"type":"object"},"key_properties":[]}
{"type":"RECORD","stream":"DSSADM-TAP_TEST_MJ","record":{"COLUMN_NAME":16},"version":1679528872508,"time_extracted":"2023-03-22T23:47:52.508937Z"}

steve_clarke

03/23/2023, 12:46 AM

Thanks very much @mark_johnston. We moved tap-oracle to using

orjson.dumps

rather than

jsonsimple.dumps

because it has significant speed benefit when it comes to serializing the record messages to json https://github.com/ijl/orjson#performance . This leads to an overall performance gain for the tap. The gotcha comes with handling the decimal datatypes if you don't use singer decimal. This fix should now output data in a more compliant jsonschema notation. I would still highly recommend that anyone using tap-oracle that they enable

"use_singer_decimal": true

. Key Reasons: 1. Oracle can use a data type of just NUMBER without any scale or precision. There is no way of guessing how big the whole number component is and how big the decimal component is. It is much safer out as a string and to cast the string to the data type you want in the target. 2. Both Python number precision and Floats are unreliable it can lead to rounding issues. It is better to output the actual integer or decimal as a string and cast it in the target. We have put some hints in tap-oracle and our variant of target-snowflake to automatically cast the data in Snowflake back to a number if there was a Scale and Precision specified in the Oracle Datatype.

3 Views

Open in Slack

Previous Next