avery_dagostino
02/03/2023, 8:18 PMtap-oracle
and I'm having trouble with numeric data types. If I set use_singer_decimal = true
, then my Number types get loaded into the target as strings, which I don't want. But if I set use_singer_decimal = false
, then I get JSON errors with Number datatypes such as:
jsonschema.exceptions.ValidationError: '16' is not of type 'number'
but 16
is a number, and I don't think it's being wrapped in a string or anything, because I don't get the same error when working with integers, only numbers.
any ideas how I can work around this?visch
02/03/2023, 8:21 PMtap-oracle
and push numerics all the time (almost 2 years in production now. I use the https://hub.meltano.com/extractors/tap-oracle--transferwise variant, maybe try that?visch
02/03/2023, 8:23 PMmeltano invoke tap-oracle > out
and try to isolate a record causing your issue. Then you can provide that to the tap maintainer to take a look atavery_dagostino
02/03/2023, 8:25 PMinvoke
functionality and it looks like my number records are coming out as stringsavery_dagostino
02/03/2023, 8:26 PMvisch
02/03/2023, 8:27 PMmeltano.yml
file?visch
02/03/2023, 8:28 PMavery_dagostino
02/03/2023, 8:30 PMversion: 1
default_environment: dev
project_id: 61b1c699-0054-4908-97e5-268c32aea971
environments:
- name: dev
- name: staging
- name: prod
plugins:
extractors:
- name: tap-oracle
variant: s7clarke10
pip_url: git+<https://github.com/s7clarke10/pipelinewise-tap-oracle.git>
config:
filter_schemas: {schema}
use_singer_decimal: false
filter_tables:
- {table}
default_replication_method: FULL_TABLE
port: 1521
host: {host}
user: {user}
service_name: {service}
loaders:
- name: target-s3-parquet
variant: gupy-io
pip_url: git+<https://github.com/gupy-io/target-s3-parquet.git>
config:
s3_path: {path}
athena_database: {db}
- name: target-csv
variant: hotgluexyz
pip_url: git+<https://github.com/hotgluexyz/target-csv.git>
config:
destination_path: output/
elt:
buffer_size: 52428800
avery_dagostino
02/03/2023, 8:30 PMavery_dagostino
02/03/2023, 8:30 PMmeltano run tap-oracle target-s3-parquet
avery_dagostino
02/03/2023, 8:31 PMmeltano invoke tap-oracle > out
I see the numbers wrapped in quotesvisch
02/03/2023, 8:34 PMuse_singer_decimal: false
probably doesn't matter so much but I'd think true
would be better.
If the number is wrapped in quotes we've got a different issue. I see you don't have a select statement which is interesting. If you run meltano invoke --dump=catalog tap-oracle > catalog.json
and find the field_name / related table_name does it have the type
you'd expect?
My assumption here is the type
isn't corret in the catalog.json but we should verify that. Then you need to go look at the schema
in oracle to verify it's set right, then depending on what's going on you may have an old cached version of the catalog? Or maybe the oracle data type isn't a numeric as you think it is?visch
02/03/2023, 8:35 PMout
file you made and look at the related schema
message, does that have the type you expect?visch
02/03/2023, 8:36 PMI can try the transferwise variant! a follow-up question on that: how would I filter tablesUse the
select
statement in meltano. https://docs.meltano.com/reference/command-line-interface#selectavery_dagostino
02/03/2023, 8:38 PM{
"multipleOf": 1e-38,
"type": [
"number"
]
},
avery_dagostino
02/03/2023, 8:39 PMavery_dagostino
02/03/2023, 8:39 PMvisch
02/03/2023, 8:42 PMoutfile
with schema and record messageavery_dagostino
02/03/2023, 8:47 PM{"type":"SCHEMA","stream":"table","schema":{"properties":{"column":{"multipleOf":1e-38,"type":["number"]}}}}
{"type":"STATE","value":{"bookmarks":{"table":{"last_replication_method":"FULL_TABLE","version":1675455842886}},"currently_syncing":"table"}}
{"type":"ACTIVATE_VERSION","stream":"table","version":1675455842886}
{"type":"RECORD","stream":"table","record":{"column":"16"}}
avery_dagostino
02/03/2023, 8:48 PM"16"
for the column value but I'd expect 16
without quotes, if I'm understanding Json schemas correctlyavery_dagostino
02/03/2023, 8:49 PMuse_singer_decimal: true
, the column becomes a string in the schema:
{
"format": "singer.decimal",
"type": [
"string"
]
}
avery_dagostino
02/03/2023, 8:49 PMvisch
02/03/2023, 8:50 PMuser_singer_decimal
I haven't messed with. Trying to look at an example on my side just hitting some unrelated thingsvisch
02/03/2023, 9:04 PMvisch
02/03/2023, 9:05 PM"EARNEDCRHRS": {"multipleOf": 1e-38, "type": ["null", "number"]}
Record
"EARNEDCRHRS": 1,
visch
02/03/2023, 9:05 PMvisch
02/03/2023, 9:06 PMavery_dagostino
02/03/2023, 9:07 PMavery_dagostino
02/03/2023, 9:08 PMservice_name
instead of an sid
in the connection config?visch
02/03/2023, 9:09 PMvisch
02/03/2023, 9:10 PMvisch
02/03/2023, 9:11 PMvisch
02/03/2023, 9:11 PMvisch
02/03/2023, 9:12 PMsid
and it might just work. But 🤷 I have to get back to my other stuff feel free to post what you findavery_dagostino
02/03/2023, 9:15 PMavery_dagostino
02/03/2023, 9:15 PMavery_dagostino
02/06/2023, 4:05 PMtap-oracle
variant and actually what I'm doing is opening a PR on the tap-s3-parquet
I'm using to properly handle singer.decimal
and datetime
data typessteve_clarke
02/13/2023, 7:46 PMsinger.decimal
notation without giving you an option some time ago. Reason numeric data was not always extracted correctly. Large numbers, numbers with a large number of decimal places, and floats were not replicated correctly (partially because of internal python handling of these numbers).
You have made the right decision sticking with singer.decimal
because testing has shown that emitting numeric data can lead to rounding issues. It is much safer to switch on singer.decimal
and use a post process like dbt to cast the data the way you would need it.
Here is another problem, if a column is defined as just number
in Oracle as a datatype. You don't know the scale (number of digits) to set on the target. If your data contains decimals a modern database will require you to define the precision and scale. It is much safer to emit the data with singer.decimal
set to true.mark_johnston
03/23/2023, 12:00 AMdecimal.Decimal
types instead of using str()
, the new version will output a number using float or int depending upon the content of the value. Here's the change:
https://github.com/mjsqu/pipelinewise-singer-python/commit/e31c16895c8900dad0bd96b98c43f9a5c6137905mark_johnston
03/23/2023, 12:02 AMNUMBER
datatype when `use_singer_decimal=False`:
{"type":"SCHEMA","stream":"DSSADM-TAP_TEST_MJ","schema":{"properties":{"COLUMN_NAME":{"multipleOf":1e-38,"type":["null","number"]}},"type":"object"},"key_properties":[]}
{"type":"RECORD","stream":"DSSADM-TAP_TEST_MJ","record":{"COLUMN_NAME":16},"version":1679528872508,"time_extracted":"2023-03-22T23:47:52.508937Z"}
steve_clarke
03/23/2023, 12:46 AMorjson.dumps
rather than jsonsimple.dumps
because it has significant speed benefit when it comes to serializing the record messages to json https://github.com/ijl/orjson#performance . This leads to an overall performance gain for the tap. The gotcha comes with handling the decimal datatypes if you don't use singer decimal. This fix should now output data in a more compliant jsonschema notation.
I would still highly recommend that anyone using tap-oracle that they enable "use_singer_decimal": true
.
Key Reasons:
1. Oracle can use a data type of just NUMBER without any scale or precision. There is no way of guessing how big the whole number component is and how big the decimal component is. It is much safer out as a string and to cast the string to the data type you want in the target.
2. Both Python number precision and Floats are unreliable it can lead to rounding issues. It is better to output the actual integer or decimal as a string and cast it in the target. We have put some hints in tap-oracle and our variant of target-snowflake to automatically cast the data in Snowflake back to a number if there was a Scale and Precision specified in the Oracle Datatype.