Hey folks, just wanted to share some WIP I put for...
# plugins-general
p
Hey folks, just wanted to share some WIP I put forth on tap-mysql for representing the precision/scale of a DECIMAL field: https://github.com/transferwise/pipelinewise-tap-mysql/pull/108. This is following up from target-bigquery putting in some work to infer DECIMAL scale from the
multipleOf
field: https://github.com/adswerve/target-bigquery/pull/23. However we noted that it’s not possible to correctly define a field with precision using just
multipleOf
, you also need to be able to compute the precision too. I’m not familiar with how other plugins tackle this; the seems like it should probably go at the
singer
library level to be honest, as I think every tap should really implement logic like this. Any pointers to prior art here that might be useful to take into account?
I'm not sure I follow you on this point:
However we noted that it’s not possible to correctly define a field with precision using just
multipleOf
, you also need to be able to compute the precision too.
Can you say more about this?
p
Sure, let’s say you have a
DECIMAL(40,2)
. If you do
tap-mysql | target-bigquery
,
multipleOf
can get you
DECIMAL(???, 2)
, i.e. the scale, but not the precision. In JSONSchema, the only way I found to represent the precision was
maximum
/
minimum
. (It seems you guys are thinking along the same lines in the meltano target-postgres you linked). The target can take a guess here if the precision is not specified, but then worst-case you’ll have an invalid schema that may look correct when you run your first pipeline, but produce a runtime loader error when you eventually encounter a row that’s bigger than the target schema supports. (For example, if you pick BigQuery NUMERIC but the source table actually requires BIGNUMERIC to hold every possible row). This should really fail at “compile time”, i.e. the target should reject the initial schema message if it can’t represent it correctly, or at very least produce a warning for the operator to resolve.
f
I’ve just noticed that all our fixed-precision NUMERIC/DECIMAL types in PostgreSQL have resulted in FLOAT columns in our Snowflake target. This seems like a simpler case than determining the difference between NUMERIC and BIGNUMERIC in BigQuery, as Snowflake effectively only has a single NUMBER type for fixed all precision use cases. Am I effectively running into the same issue of precision/scale being lost in translation, or something different?
m
I'm working on a tap-db2 -> target-snowflake implementation and have gone for the following approach to setting up different data types in Snowflake - and not losing precision in the RECORD messages: • DECIMALs from DB2 are given format property "singer.decimal" in the schema, with type property "number". The numeric scale & precision are stored as
schema[additionalProperties][scale_precision]
, e.g.
"additionalProperties":{"scale_precision":"(15,9)"}
• When the tap writes the RECORD message, "singer.decimal" informs the data value to be output into the message as a string • The snowflake target obtains the scale/precision from the schema message to create a column of type NUMBER(scale, precision) The idea for the singer.decimal type came from tap-oracle Forks: tap-db2 (feature/singer_decimal) target-snowflake (feature/snowflake_number_type) Pip_url(s) for meltano.yml if you would like to test:
Copy code
pip_url: git+<https://github.com/mjsqu/tap-db2.git@feature/singer_decimal>
pip_url: git+<https://github.com/mjsqu/pipelinewise-target-snowflake@feature/snowflake_number_type>