BuzzCutNorman
01/24/2023, 4:17 PMsql-datatype
value to the to_sql_type
function of a loader/target. I have poked around the SDK and found a shema
variable/parameter that contain's the stream/table schema properties but I haven't found where it is initially generated so I could add the sql-datatype
into it. The other option was to pull it from the schema message again but I haven't come across a clear path to accomplish that either. I am betting its one of those times the answer is staring at me and laughing as I look past it over and over again.visch
01/24/2023, 9:11 PMSCHEMA
messages, not catalogs.
It's possible to extend SCHEMA
message to include the actual sql-datatype
and use that in the target. That would be an extension of the spec but maybe not horrible.
I've done this via the description field for a BLOB but it is hacky.
What target are you trying to use? I'm guessing it's a DB target? What column type are you having issues with? We maybe able to get to a better answer if we're lucky and start thereBuzzCutNorman
01/24/2023, 9:17 PMmeltano invoke tap-mssql --discover
returns which I thought should match up with the schema message passed to the target by the tap.
{
"tap_stream_id": "dbo-tags",
"table_name": "tags",
"replication_method": "",
"key_properties": [
"Id"
],
"schema": {
"properties": {
"Id": {
"type": [
"integer"
]
},
"TagName": {
"type": [
"string",
"null"
]
},
"Count": {
"type": [
"integer",
"null"
]
},
"ExcerptPostId": {
"type": [
"integer",
"null"
]
},
"WikiPostId": {
"type": [
"integer",
"null"
]
}
},
"type": "object",
"required": [
"Id"
]
},
"is_view": false,
"stream": "dbo-tags",
"metadata": [
{
"breadcrumb": [
"properties",
"Id"
],
"metadata": {
"inclusion": "automatic",
"sql-datatype": "INTEGER()"
}
},
{
"breadcrumb": [
"properties",
"TagName"
],
"metadata": {
"inclusion": "available",
"sql-datatype": "VARCHAR()"
}
},
{
"breadcrumb": [
"properties",
"Count"
],
"metadata": {
"inclusion": "available",
"sql-datatype": "INTEGER()"
}
},
{
"breadcrumb": [
"properties",
"ExcerptPostId"
],
"metadata": {
"inclusion": "available",
"sql-datatype": "INTEGER()"
}
},
{
"breadcrumb": [
"properties",
"WikiPostId"
],
"metadata": {
"inclusion": "available",
"sql-datatype": "INTEGER()"
}
},
{
"breadcrumb": [],
"metadata": {
"inclusion": "available",
"table-key-properties": [
"Id"
],
"forced-replication-method": "",
"schema-name": "dbo"
}
}
]
},
visch
01/24/2023, 9:29 PMSCHEMA
message should work fine. What's the issue you're having with the normal singer spec here?visch
01/24/2023, 9:30 PMBuzzCutNorman
01/24/2023, 9:33 PMBuzzCutNorman
01/24/2023, 9:41 PMsql-datatype
built into the spec I think to address this so I don't have an issue with the spec just was trying to implement this piece of it.visch
01/24/2023, 9:54 PMscale and precision of decimals and numericsThis is actually handled pretty good with
number
using multipleOf
Here's an implementation https://gitlab.com/autoidm/autoidm-target-mssql/-/blob/main/target_mssql/streams.py#L200-210
the byte size of varchars and nvarchars
maxLength
handles this https://gitlab.com/autoidm/autoidm-target-mssql/-/blob/main/target_mssql/streams.py#L195 for String types
if the string was a varchar or nvarchar on the source side.Is a bit deeper. Source side: I don't think this is written down but generally UTF-8 is already what JSON is encoded in. re https://www.rfc-editor.org/rfc/rfc8259
JSON text exchanged between systems that are not part of a closed
ecosystem MUST be encoded using UTF-8 [RFC3629].
I haven't seen folks go too deep on this but that's the general "rule" from the source side.
On the target side (target-mssql):
Maybe we should just go with NVARCHAR
as the string type and call it a day?
Singer has a theSinger defines this in thebuilt into the spec I think to address this so I don't have an issue with the spec just was trying to implement this piece of it.sql-datatype
catalog
, not in the schema
. `schema`'s are the only thing that's sent to the target to use for type informationvisch
01/24/2023, 9:55 PMBLOB
and others and I'm not certain of a great solutionBuzzCutNorman
01/24/2023, 10:10 PMsql-datatype
. Which from what you are saying I would have add sql-datatype
to the schema message sent by the tap since the metadata is not natively included. The metadata is only used on the tap side before sending a schema message to the target? The schema message spec doesn't normally allow for this from the sounds of it ?
https://github.com/BuzzCutNorman/tap-mssql/blob/328d375bcc95f8bb9a8abf86fd8bbd597e5f3990/tap_mssql/client.py#L79-L202visch
01/24/2023, 10:52 PMThe metadata is only used on the tap side before sending a schema message to the target?Yes
The schema message spec doesn't normally allow for this from the sounds of it ?Well It's a JSON Schema so you can do a lot with it (Json Schema Draft 4) , there's nothing saying you couldn't use the
description
field but I think a custom
type would be better (I think that's possible with draft4)
The issue isn't so much the custom types as it is with every target should work with every tap. Therefore custom types from a tap
are really a no go. But maybe there could be a "hints" mechanismBuzzCutNorman
01/24/2023, 11:12 PMvisch
01/24/2023, 11:30 PMmanifest.json
and implement the same thing.
I don't love that eitherBuzzCutNorman
01/24/2023, 11:33 PMBuzzCutNorman
01/24/2023, 11:38 PMaaronsteers
01/25/2023, 12:24 AMaaronsteers
01/25/2023, 12:27 AMsql-datatype
is treated as an 'annotation' to document upstream type, and not necessarily as a dictate of downstream storage type.aaronsteers
01/25/2023, 12:28 AMaaronsteers
01/25/2023, 12:29 AMsql-datatype
as a creation hint (although some might), in part because different systems will size the same type differently, and INTEGER
isn't going to be the same size on every source and target system.aaronsteers
01/25/2023, 12:31 AMmultipleOf
hint to provide transferrable precision constraints, such as multipleOf: 0.01
being used for something decimal-like with two precision points.aaronsteers
01/25/2023, 12:32 AMsql-datatype
in the target to give a high-fidelity copy of data types as well as the data itself.aaronsteers
01/25/2023, 12:34 AMHere is whatA good test before moving forward would be to double check that when you run a sync likereturnsmeltano invoke tap-mssql --discover
meltano invoke tap-mssql > outfile.jsonl
that you also see your sql-datatype
annotations in the SCHEMA
messages - since those are what get passed to the target.aaronsteers
01/25/2023, 12:40 AMsql-datatype
annotations are reaching the target, you could use those in the target layer to guide type selection.
If other users are going to leverage the same target, this would need to be an opt-in config decision from the user, because the default behavior should not assume the tightest possible column fit, and also would need the users' blessing to know the types of the upstream type are compatible with the target's type system. And related to the above multipleOf
discussion, this would presumably override those native JSON Schema hints in favor of trying to honor whereever possible the received sql-datatype
annotation in the schema.aaronsteers
01/25/2023, 12:43 AMvisch
01/25/2023, 12:56 AMsql-datatype
annotations in the schema
I think. I didn't dive into the SDK to look maybe you know off hand?aaronsteers
01/25/2023, 1:16 AMsql-datatype
is getting into the SCHEMA
message but @BuzzCutNorman would need to confirm that.BuzzCutNorman
01/25/2023, 1:20 AMmeltano invoke tap-mssql > outfile.jsonl
but will this evening sometime and let you know what I get.BuzzCutNorman
01/25/2023, 3:14 AMsql-datatype
is not present in the SCHEMA
message sent by my tap-mssql.
{
"type": "SCHEMA",
"stream": "dbo-tags",
"schema": {
"properties": {
"Id": {
"type": [
"integer"
]
},
"TagName": {
"type": [
"string",
"null"
]
},
"Count": {
"type": [
"integer",
"null"
]
},
"ExcerptPostId": {
"type": [
"integer",
"null"
]
},
"WikiPostId": {
"type": [
"integer",
"null"
]
}
},
"type": "object",
"required": [
"Id"
]
},
"key_properties": [
"Id"
]
}
aaronsteers
01/25/2023, 4:10 AMaaronsteers
01/25/2023, 4:12 AMBuzzCutNorman
01/25/2023, 4:20 PM