Q.  How do we get a decimal type into postgres? F...
# singer-tap-development
a
Q.  How do we get a decimal type into postgres? First some background - Consider the orders source in tap-shopify, where there are two properties that need types for simple aggregations. created_at - timestamp total_price - monetary amount / decimal I think the follow principles should apply to all taps: • data should arrive unchanged from the tap -> target • data should arrive with an appropriate type in the target To explain further - With no type information, the orders table will have strings as follows. created_at(varchar) | total_price(varchar) 2022-02-25T083132 | 198.00 2022-02-24T163754 | 99.00 2022-02-23T191802 | 99.00 2022-02-23T191727 | 594.00 2022-02-22T123629 | 198.00 2022-02-22T120013 | 99.00 To perform a simple aggregation by day would require the following and is particularly problematic considering support for multiple database types (postgres, snowflake, big query, etc):
Copy code
SELECT
    to_timestamp(created_at, 'YYYY-MM-DDTHH:MI:SS')::date "orders.created_at"
    , SUM(total_price::DECIMAL) "orders.sum_total_price"
  FROM orders
  WHERE to_timestamp(created_at, 'YYYY-MM-DDTHH:MI:SS')::date >= current_date - interval '30 day'
  GROUP BY "orders.created_at"
  ORDER BY "orders.created_at" DESC
I see the following possible options: • tap transforms the types, before serialising the json output  (e.g. 5.99, five dollars and ninety nine cents becomes a number type in the json payload, the risk is that some targets do not use a type with enough bits to store the value correctly) • target transforms the types, with the “format” hint from the tap the target can use the correct type in the database • transform in a data warehouse prepare stage, with dbt staging models Taking ‘created_at’ as an example ‘format’ hint is the best option -
Copy code
"created_at": {
      "format": "date-time",
      "type": "string"
    },
with this implemented the SQL aggregation is simply
Copy code
SELECT
    created_at::date "orders.created_at"
    , SUM(total_price::DECIMAL) "orders.sum_total_price"
  FROM orders
  WHERE created_at::date >= current_date - interval '30 day'
  GROUP BY created_at::date
  ORDER BY created_at::date DESC
My understanding is that a similar hint to ‘total_price’ should yield a decimal type in postgres from that target.
Copy code
"total_price": {
      "format": "singer.decimal",
      "type": "string"
    },
And a query of
Copy code
SELECT
    created_at::date "orders.created_at"
    , SUM(total_price) "orders.sum_total_price"
  FROM orders
  WHERE created_at::date >= current_date - interval '30 day'
  GROUP BY created_at::date
  ORDER BY created_at::date DESC
However, ‘singer.decimal’ in the transferwise variant doesn’t seem to be implemented.
v
https://meltano.slack.com/archives/C01PKLU5D1R/p1645648599666349 this is almost the same conversation as here right? Just more info
• What is decimal?: in this context • Who is defining what decimal is? (json schema doesn't' do this by default assumedly on purpose) • Instead of How do we get a decimal type into postgres?, can we instead say "I have data in Shopify that is a price, I'd like to get that represented in postgres, here's what I"m thinking" . Is this accurate? The source matters in this question. Database tap's are imo more interesting with these kind of questions as I think the answer for Shopify is the data that represents price is represented as a string in json by shopify's api, therefore it's a string.
tap transforms the types, before serialising the json output  (e.g. 5.99, five dollars and ninety nine cents becomes a number type in the json payload, the risk is that some targets do not use a type with enough bits to store the value correctly)
Who defines the type? Support more types to map to? Trade off is that we add subjectivity to typing, but is that valuable enough?
target transforms the types, with the “format” hint from the tap the target can use the correct type in the database
Same problem, who define the type? Just implemented a bit different
transform in a data warehouse prepare stage, with dbt staging models
The answer most everyone goes with right now that I know of
Last comment i swear. In the tap-shopify case, dbt makes sense. In the case where a source defines the datatype for you (like a database) I think we get into different territory. re https://gitlab.com/autoidm/tap-oracle/-/blob/main/tap_oracle/__init__.py#L104-132 (Different set of challenges, but similar idea, key here is that the definition are already in the source for what this data is which removes most of the subjectivity) I'm sharing most of this as I'm super curious to hear others opinions on how this should behave as I've also spent a bunch of time thinking about this!
a
Right, same question as the other thread generally, but a lot more specific to target-postgres. I think the problem is this: • a target transforms the types, with the “format” hint from the tap. It is up to the target can use the correct type in the database As an example, tap-quickbooks should have good handling of MonetaryAmounts. So this line here, in the tap https://github.com/singer-io/tap-quickbooks/blob/master/tap_quickbooks/schemas/accounts.json#L4 provides enough information to the target to turn the string value from the tap into a good database type for a Decimal. Whatever that type might be on that target database. The problem is that transferwise postgres variant has no understanding of these json schema 'format' hints for 'singer.decimal'. It does understand 'date-time' as per my example. Turning an ISO8601 timestamp from string into the postgres 'timestamp' column type
a
If I'm following this correctly, the tap stores the numeric (decimal) value as a string, with a format hint. But it's there a strong reason not to use a "number" type? https://json-schema.org/understanding-json-schema/reference/numeric.html
I've seen "number" combined with "multipleOf" to inform decimal precision, for example multiple of 0.001 means three significant digits after the decimal.
a
Yes, tap fetches a string value from the api endpoint. I have two concerns to using a number and multiple of: 1. How do I cleanly convert the string to number with the SDK. I saw a post processor or maybe a mapper? 2. Does multipleOf actually work? https://stackoverflow.com/questions/40886252/defining-a-json-schema-currency-type
Results of some more testing. 1. Implemented a post processor as follows:
Copy code
def post_process(self, row: dict, context: Optional[dict] = None) -> Optional[dict]:
        super().post_process(row, context)
        row["subtotal_price"] = Decimal(row["subtotal_price"])
        row["total_price"] = Decimal(row["total_price"])
        return row
2. 'multipleOf'
Copy code
"total_price": {
      "type": "number",
      "minimum": 0,
      "multipleOf": 0.01
    },
    "subtotal_price": {
      "type": "number"
    },
The validation of 0.01 multiple seems to work ok. However, the column type in postgres is float8. This needs to be a decimal or numeric type to accurately store monetary values. You may see issues with a single cent lost here and there when the amounts are small. You'll see the issue a lot more often in Indonesian Rupiah where 1 Rupiah equals 0.000070 United States Dollars This leads my to conclude that there needs to be better handling of decimal types in the default postgres loader (transferwise variant). When the default loader was the meltano variant, all number types were stored as numeric. Thoughts? • just patch the transferwise variant to default to numeric for number types perhaps? • implement handling for 'singer.decimal' • implement all loaders with good tests in the SDK?
v
You may see issues with a single cent lost here and there when the amounts are small. You'll see the issue a lot more often in Indonesian Rupiah where 1 Rupiah equals 0.000070 United States Dollars
Can you make a more concrete example? Do you mean that the tap would send the wrong information due to precision issues? Like if the number is 0.000070
Copy code
value: "0.000070"
you're saying this wouldn't work?
Or wouldn't pass validation if multipleOf was set to
.00001
My favorite way to set these kinds of things up with Singer is to just make a file with the SCHEMA and RECORD row, and then show the issue with the target. Is it validation? is it type conversation
a
Hi @visch the issue with Indonesian Rupiah (any weak currency) is that you blow the 15 digits decimal precision (for float8) very quickly. A Big Mac cost 50,470 Rupiah. A numeric or decimal type in postgres has up to 131072 digits before the decimal point; up to 16383 digits after the decimal point. This issue that needs to be addressed (in my opinion), is that the float type is the wrong choice in this case and the tap should be able to hint to the target what a sensible type should be. 1. The SDK would be better with standard types for decimals / monetary amounts (we've created an issue and MR for that) 2. There is a way to pass the hints - i.e. though the json schema 'format'. This particular target doesn't have any support for a decimal 'format' or any other formats except 'date-time'. Without Singer standard types defined, the targets have no standard to conform to so all target are going to have different behaviours.
v
I understand floats, I'm trying to understand your issue. I think you're saying the issue is with the target data type, not with the tap data. Is that right? You're worried about target-postgres pushing data into a field with the type of float8. I think you're focusing on the singer representation of the data in-between, my suggestion is we start at the target data type we want. In this case in target-postgres you'd like the data type to be
NUMERIC
correct? I want to start at some solid ground instead of pointing at the json schema for singer. To me solid ground is the data from the tap, and how the data is passed into the target. Focusing on the inbetween layer comes after we define what we want imo
a
Good plan. How's this? As a data analyst, I need NUMERIC total_price for orders from shopify in my postgres database. My environment: • Meltano SDK tap https://github.com/Matatika/tap-shopify • Default Meltano target-postgres https://github.com/transferwise/pipelinewise-target-postgres Example:
Copy code
meltano run tap-shopify target-postgres
Copy code
SELECT
    created_at::date "orders.created_at"
    , SUM(total_price) "orders.sum_total_price"
  FROM orders
  WHERE created_at::date >= current_date - interval '30 day'
  GROUP BY created_at::date
  ORDER BY created_at::date DESC
v
Awesome helps me ground myself here! So target-postgres should use numeric for your use case https://www.postgresql.org/docs/9.0/datatype-numeric.html . Looks like default precision will work. Next thing to look at , doesthe default variant of target-postgres currently implement numeric in a way that's helpful? https://github.com/transferwise/pipelinewise-target-postgres/blob/5e8891d7a839c35dbc4e3d3e5dd021db483afac0/target_postgres/db_sync.py#L68 nope. Does another variant do this? https://github.com/meltano/target-postgres/blob/master/target_postgres/__init__.py#L49 looks to be our winner. Uses multipleOf, and minimum and maximum. Cool so now we know we need to use https://github.com/meltano/target-postgres/blob/master/target_postgres/__init__.py postgres variant (hopefully that works for your use case), and we know we need to provide a
number
with
multipleOf
,
minimum
and
maximum
) That also answers what the singer representation is for this target (We'll hold aside the argument about what should/shouldn't be there as this target forces our hand) --- tap-shopify now needs to provide this data in the correct format. I don't know much about tap-shopify, took a quick peak at docs here https://shopify.dev/api/admin-rest/2022-01/resources/order#resource-object. Found price json example here
Copy code
"current_total_price": "10.00",
  "current_total_price_set": {
    "current_total_price_set": {
      "shop_money": {
        "amount": "30.00",
        "currency_code": "CAD"
      },
      "presentment_money": {
        "amount": "20.00",
        "currency_code": "EUR"
      }
    }
  }
In this example the price is given as a string. This makes all of the target-postgres stuff we did ahead of this moot as the API provides the data as a string. Now we could argue that we should convert from a string to currency in the tap (I think that's a stretch, as dbt is the normal spot for this but I'll go with it), we'd need to add some subjectivity here and pick the fields with _price and then create the corresponding
number
with
multipleOf
,
maximum
,
minimum
values for that.
a
Thanks @visch, really helpful. I'll update here on any wisdom from the office hours.
a
Thanks @aaron_phethean for raising this important topic and bringing to office hours. As discussed, here's the logged issue from our conversation: https://gitlab.com/meltano/sdk/-/issues/342
a
Thanks @aaronsteers I added a comment to the issue with all the useful parts from this thread. Thanks @visch for all the excellent input!
v
Fun chatting @aaron_phethean good to hear all the view points during office hours. I lean towards string for tap-shopfiy but at a minimum the implications here for other taps are very interesting (and it sounds like a decent number of folks think converting at the tap is the right spot so maybe I need to update the way I think about this!). Database taps is where I think about this stuff a lot