avery_dagostino
02/03/2023, 8:18 PMtap-oracle and I'm having trouble with numeric data types. If I set use_singer_decimal = true, then my Number types get loaded into the target as strings, which I don't want. But if I set use_singer_decimal = false, then I get JSON errors with Number datatypes such as:
jsonschema.exceptions.ValidationError: '16' is not of type 'number'
but 16 is a number, and I don't think it's being wrapped in a string or anything, because I don't get the same error when working with integers, only numbers.
any ideas how I can work around this?visch
02/03/2023, 8:21 PMtap-oracle and push numerics all the time (almost 2 years in production now. I use the https://hub.meltano.com/extractors/tap-oracle--transferwise variant, maybe try that?visch
02/03/2023, 8:23 PMmeltano invoke tap-oracle > out and try to isolate a record causing your issue. Then you can provide that to the tap maintainer to take a look atavery_dagostino
02/03/2023, 8:25 PMinvoke functionality and it looks like my number records are coming out as stringsavery_dagostino
02/03/2023, 8:26 PMvisch
02/03/2023, 8:27 PMmeltano.yml file?visch
02/03/2023, 8:28 PMavery_dagostino
02/03/2023, 8:30 PMversion: 1
default_environment: dev
project_id: 61b1c699-0054-4908-97e5-268c32aea971
environments:
- name: dev
- name: staging
- name: prod
plugins:
extractors:
- name: tap-oracle
variant: s7clarke10
pip_url: git+<https://github.com/s7clarke10/pipelinewise-tap-oracle.git>
config:
filter_schemas: {schema}
use_singer_decimal: false
filter_tables:
- {table}
default_replication_method: FULL_TABLE
port: 1521
host: {host}
user: {user}
service_name: {service}
loaders:
- name: target-s3-parquet
variant: gupy-io
pip_url: git+<https://github.com/gupy-io/target-s3-parquet.git>
config:
s3_path: {path}
athena_database: {db}
- name: target-csv
variant: hotgluexyz
pip_url: git+<https://github.com/hotgluexyz/target-csv.git>
config:
destination_path: output/
elt:
buffer_size: 52428800avery_dagostino
02/03/2023, 8:30 PMavery_dagostino
02/03/2023, 8:30 PMmeltano run tap-oracle target-s3-parquetavery_dagostino
02/03/2023, 8:31 PMmeltano invoke tap-oracle > out I see the numbers wrapped in quotesvisch
02/03/2023, 8:34 PMuse_singer_decimal: false probably doesn't matter so much but I'd think true would be better.
If the number is wrapped in quotes we've got a different issue. I see you don't have a select statement which is interesting. If you run meltano invoke --dump=catalog tap-oracle > catalog.json and find the field_name / related table_name does it have the type you'd expect?
My assumption here is the type isn't corret in the catalog.json but we should verify that. Then you need to go look at the schema in oracle to verify it's set right, then depending on what's going on you may have an old cached version of the catalog? Or maybe the oracle data type isn't a numeric as you think it is?visch
02/03/2023, 8:35 PMout file you made and look at the related schema message, does that have the type you expect?visch
02/03/2023, 8:36 PMI can try the transferwise variant! a follow-up question on that: how would I filter tablesUse the
select statement in meltano. https://docs.meltano.com/reference/command-line-interface#selectavery_dagostino
02/03/2023, 8:38 PM{
"multipleOf": 1e-38,
"type": [
"number"
]
},avery_dagostino
02/03/2023, 8:39 PMavery_dagostino
02/03/2023, 8:39 PMvisch
02/03/2023, 8:42 PMoutfile with schema and record messageavery_dagostino
02/03/2023, 8:47 PM{"type":"SCHEMA","stream":"table","schema":{"properties":{"column":{"multipleOf":1e-38,"type":["number"]}}}}
{"type":"STATE","value":{"bookmarks":{"table":{"last_replication_method":"FULL_TABLE","version":1675455842886}},"currently_syncing":"table"}}
{"type":"ACTIVATE_VERSION","stream":"table","version":1675455842886}
{"type":"RECORD","stream":"table","record":{"column":"16"}}avery_dagostino
02/03/2023, 8:48 PM"16" for the column value but I'd expect 16 without quotes, if I'm understanding Json schemas correctlyavery_dagostino
02/03/2023, 8:49 PMuse_singer_decimal: true , the column becomes a string in the schema:
{
"format": "singer.decimal",
"type": [
"string"
]
}avery_dagostino
02/03/2023, 8:49 PMvisch
02/03/2023, 8:50 PMuser_singer_decimal I haven't messed with. Trying to look at an example on my side just hitting some unrelated thingsvisch
02/03/2023, 9:04 PMvisch
02/03/2023, 9:05 PM"EARNEDCRHRS": {"multipleOf": 1e-38, "type": ["null", "number"]}
Record
"EARNEDCRHRS": 1,visch
02/03/2023, 9:05 PMvisch
02/03/2023, 9:06 PMavery_dagostino
02/03/2023, 9:07 PMavery_dagostino
02/03/2023, 9:08 PMservice_name instead of an sid in the connection config?visch
02/03/2023, 9:09 PMvisch
02/03/2023, 9:10 PMvisch
02/03/2023, 9:11 PMvisch
02/03/2023, 9:11 PMvisch
02/03/2023, 9:12 PMsid and it might just work. But 🤷 I have to get back to my other stuff feel free to post what you findavery_dagostino
02/03/2023, 9:15 PMavery_dagostino
02/03/2023, 9:15 PMavery_dagostino
02/06/2023, 4:05 PMtap-oracle variant and actually what I'm doing is opening a PR on the tap-s3-parquet I'm using to properly handle singer.decimal and datetime data typessteve_clarke
02/13/2023, 7:46 PMsinger.decimal notation without giving you an option some time ago. Reason numeric data was not always extracted correctly. Large numbers, numbers with a large number of decimal places, and floats were not replicated correctly (partially because of internal python handling of these numbers).
You have made the right decision sticking with singer.decimal because testing has shown that emitting numeric data can lead to rounding issues. It is much safer to switch on singer.decimal and use a post process like dbt to cast the data the way you would need it.
Here is another problem, if a column is defined as just number in Oracle as a datatype. You don't know the scale (number of digits) to set on the target. If your data contains decimals a modern database will require you to define the precision and scale. It is much safer to emit the data with singer.decimal set to true.mark_johnston
03/23/2023, 12:00 AMdecimal.Decimal types instead of using str(), the new version will output a number using float or int depending upon the content of the value. Here's the change:
https://github.com/mjsqu/pipelinewise-singer-python/commit/e31c16895c8900dad0bd96b98c43f9a5c6137905mark_johnston
03/23/2023, 12:02 AMNUMBER datatype when `use_singer_decimal=False`:
{"type":"SCHEMA","stream":"DSSADM-TAP_TEST_MJ","schema":{"properties":{"COLUMN_NAME":{"multipleOf":1e-38,"type":["null","number"]}},"type":"object"},"key_properties":[]}
{"type":"RECORD","stream":"DSSADM-TAP_TEST_MJ","record":{"COLUMN_NAME":16},"version":1679528872508,"time_extracted":"2023-03-22T23:47:52.508937Z"}steve_clarke
03/23/2023, 12:46 AMorjson.dumps rather than jsonsimple.dumps because it has significant speed benefit when it comes to serializing the record messages to json https://github.com/ijl/orjson#performance . This leads to an overall performance gain for the tap. The gotcha comes with handling the decimal datatypes if you don't use singer decimal. This fix should now output data in a more compliant jsonschema notation.
I would still highly recommend that anyone using tap-oracle that they enable "use_singer_decimal": true .
Key Reasons:
1. Oracle can use a data type of just NUMBER without any scale or precision. There is no way of guessing how big the whole number component is and how big the decimal component is. It is much safer out as a string and to cast the string to the data type you want in the target.
2. Both Python number precision and Floats are unreliable it can lead to rounding issues. It is better to output the actual integer or decimal as a string and cast it in the target. We have put some hints in tap-oracle and our variant of target-snowflake to automatically cast the data in Snowflake back to a number if there was a Scale and Precision specified in the Oracle Datatype.