I was wondering if I someone could point me in the...
# singer-target-development
b
I was wondering if I someone could point me in the proper direction. I have a SQL extractor/tap that places a column's sql-datatype into the metadata section of the catalog generated during the database discovery. I am happy I have this information in the catalog but now it is there I can't figure out how to get the
sql-datatype
value to the
to_sql_type
function of a loader/target. I have poked around the SDK and found a
shema
variable/parameter that contain's the stream/table schema properties but I haven't found where it is initially generated so I could add the
sql-datatype
into it. The other option was to pull it from the schema message again but I haven't come across a clear path to accomplish that either. I am betting its one of those times the answer is staring at me and laughing as I look past it over and over again.
v
We communicate types via
SCHEMA
messages, not catalogs. It's possible to extend
SCHEMA
message to include the actual
sql-datatype
and use that in the target. That would be an extension of the spec but maybe not horrible. I've done this via the description field for a BLOB but it is hacky. What target are you trying to use? I'm guessing it's a DB target? What column type are you having issues with? We maybe able to get to a better answer if we're lucky and start there
b
I am using my target-mssql at the moment. Right now I am doing simple column type like INTEGER() and VARCHAR(). Here is what
meltano invoke tap-mssql --discover
returns which I thought should match up with the schema message passed to the target by the tap.
Copy code
{
      "tap_stream_id": "dbo-tags",
      "table_name": "tags",
      "replication_method": "",
      "key_properties": [
        "Id"
      ],
      "schema": {
        "properties": {
          "Id": {
            "type": [
              "integer"
            ]
          },
          "TagName": {
            "type": [
              "string",
              "null"
            ]
          },
          "Count": {
            "type": [
              "integer",
              "null"
            ]
          },
          "ExcerptPostId": {
            "type": [
              "integer",
              "null"
            ]
          },
          "WikiPostId": {
            "type": [
              "integer",
              "null"
            ]
          }
        },
        "type": "object",
        "required": [
          "Id"
        ]
      },
      "is_view": false,
      "stream": "dbo-tags",
      "metadata": [
        {
          "breadcrumb": [
            "properties",
            "Id"
          ],
          "metadata": {
            "inclusion": "automatic",
            "sql-datatype": "INTEGER()"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "TagName"
          ],
          "metadata": {
            "inclusion": "available",
            "sql-datatype": "VARCHAR()"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "Count"
          ],
          "metadata": {
            "inclusion": "available",
            "sql-datatype": "INTEGER()"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "ExcerptPostId"
          ],
          "metadata": {
            "inclusion": "available",
            "sql-datatype": "INTEGER()"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "WikiPostId"
          ],
          "metadata": {
            "inclusion": "available",
            "sql-datatype": "INTEGER()"
          }
        },
        {
          "breadcrumb": [],
          "metadata": {
            "inclusion": "available",
            "table-key-properties": [
              "Id"
            ],
            "forced-replication-method": "",
            "schema-name": "dbo"
          }
        }
      ]
    },
v
For Integer and VARCHAR those are handled today very well in the json spec. So in the
SCHEMA
message should work fine. What's the issue you're having with the normal singer spec here?
I guess specifically what's the issue you're having with json schema types
b
I am trying to get a more accurate mapping between sql database tables. I want to be able to keep scale and precision of decimals and numerics, the byte size of varchars and nvarchars and also if the string was a varchar or nvarchar on the source side.
Singer has a the
sql-datatype
built into the spec I think to address this so I don't have an issue with the spec just was trying to implement this piece of it.
v
scale and precision of decimals and numerics
This is actually handled pretty good with
number
using
multipleOf
Here's an implementation https://gitlab.com/autoidm/autoidm-target-mssql/-/blob/main/target_mssql/streams.py#L200-210
the byte size of varchars and nvarchars
maxLength
handles this https://gitlab.com/autoidm/autoidm-target-mssql/-/blob/main/target_mssql/streams.py#L195 for String types
if the string was a varchar or nvarchar on the source side.
Is a bit deeper. Source side: I don't think this is written down but generally UTF-8 is already what JSON is encoded in. re https://www.rfc-editor.org/rfc/rfc8259
Copy code
JSON text exchanged between systems that are not part of a closed
   ecosystem MUST be encoded using UTF-8 [RFC3629].
I haven't seen folks go too deep on this but that's the general "rule" from the source side. On the target side (target-mssql): Maybe we should just go with
NVARCHAR
as the string type and call it a day?
Singer has a the
sql-datatype
built into the spec I think to address this so I don't have an issue with the spec just was trying to implement this piece of it.
Singer defines this in the
catalog
, not in the
schema
. `schema`'s are the only thing that's sent to the target to use for type information
Where I do see a need for this is custom types like
BLOB
and others and I'm not certain of a great solution
b
I get what you are saying about JSON and it can take you a long way. Below is what I got done for tap-mssql before I started down the path of trying to utilize
sql-datatype
. Which from what you are saying I would have add
sql-datatype
to the schema message sent by the tap since the metadata is not natively included. The metadata is only used on the tap side before sending a schema message to the target? The schema message spec doesn't normally allow for this from the sounds of it ? https://github.com/BuzzCutNorman/tap-mssql/blob/328d375bcc95f8bb9a8abf86fd8bbd597e5f3990/tap_mssql/client.py#L79-L202
v
The metadata is only used on the tap side before sending a schema message to the target?
Yes
The schema message spec doesn't normally allow for this from the sounds of it ?
Well It's a JSON Schema so you can do a lot with it (Json Schema Draft 4) , there's nothing saying you couldn't use the
description
field but I think a
custom
type would be better (I think that's possible with draft4) The issue isn't so much the custom types as it is with every target should work with every tap. Therefore custom types from a
tap
are really a no go. But maybe there could be a "hints" mechanism
b
Ah I see, Thanks for the explanation. I feel a little better knowing I wasn't crazy and I didn't miss anything on the target side. The bummer is I am kind of back where I was a while back. At least I am armed with more info this time. I'm starting to wonder if effort put toward coming up with common/standardized JSON schema fingerprints for ANSI SQL types would be a better in the long run.
v
I wonder if we could make sure whatever types we chose in JSONSchema could be lossless. Then if you really wanted exact types you could ingest the
manifest.json
and implement the same thing. I don't love that either
b
I know, each time I look at this and think I have a great idea. It falls apart on me. I am starting to feel like Charlie Brown getting the football pulled away by Lucy.
Like you said I want a way to communicate exact sql types that works with the normal tap tartget schema messages communication method and is JSON Schema compliant without the use of custom types.
a
@BuzzCutNorman - What are the source and destination, out of curiosity? Is this SQL Server to SQL Server?
As you are finding, the
sql-datatype
is treated as an 'annotation' to document upstream type, and not necessarily as a dictate of downstream storage type.
There are good reasons, even when high-fidelity information is available, for targets to ignore it and still build using generic widely-defined types.
I don't know of any targets which use the
sql-datatype
as a creation hint (although some might), in part because different systems will size the same type differently, and
INTEGER
isn't going to be the same size on every source and target system.
I have seen some systems use the
multipleOf
hint to provide transferrable precision constraints, such as
multipleOf: 0.01
being used for something decimal-like with two precision points.
All that said, the reason why I ask if your source is the same type as the target (both MS SQL Server) is because you could at least in theory use the
sql-datatype
in the target to give a high-fidelity copy of data types as well as the data itself.
Here is what
meltano invoke tap-mssql --discover
returns
A good test before moving forward would be to double check that when you run a sync like
meltano invoke tap-mssql > outfile.jsonl
that you also see your
sql-datatype
annotations in the
SCHEMA
messages - since those are what get passed to the target.
Assuming the
sql-datatype
annotations are reaching the target, you could use those in the target layer to guide type selection. If other users are going to leverage the same target, this would need to be an opt-in config decision from the user, because the default behavior should not assume the tightest possible column fit, and also would need the users' blessing to know the types of the upstream type are compatible with the target's type system. And related to the above
multipleOf
discussion, this would presumably override those native JSON Schema hints in favor of trying to honor whereever possible the received
sql-datatype
annotation in the schema.
Let me know if this is helpful! I can't answer for you if it's worth the wrangling to get this done, but in theory it is possible. 🤷
v
@aaronsteers he was asking how to get
sql-datatype
annotations in the
schema
I think. I didn't dive into the SDK to look maybe you know off hand?
a
It might be that
sql-datatype
is getting into the
SCHEMA
message but @BuzzCutNorman would need to confirm that.
b
@aaronsteers Thanks for all the information and a glimmer of hope this might be possible. The source and destination right now are SQL Server to SQL Server to keep things simple. I completely get that it might be best to use a generic type and yes there are variation between RDMS implementations of datatypes. I have not tried the
meltano invoke tap-mssql > outfile.jsonl
but will this evening sometime and let you know what I get.
@aaronsteers unfortunately
sql-datatype
is not present in the
SCHEMA
message sent by my tap-mssql.
Copy code
{
  "type": "SCHEMA",
  "stream": "dbo-tags",
  "schema": {
    "properties": {
      "Id": {
        "type": [
          "integer"
        ]
      },
      "TagName": {
        "type": [
          "string",
          "null"
        ]
      },
      "Count": {
        "type": [
          "integer",
          "null"
        ]
      },
      "ExcerptPostId": {
        "type": [
          "integer",
          "null"
        ]
      },
      "WikiPostId": {
        "type": [
          "integer",
          "null"
        ]
      }
    },
    "type": "object",
    "required": [
      "Id"
    ]
  },
  "key_properties": [
    "Id"
  ]
}
a
Thanks, @BuzzCutNorman for confirming. I think this validates what @visch calls out here: https://meltano.slack.com/archives/C01RKUVUG4S/p1674600771626969?thread_ts=1674577030.714159&cid=C01RKUVUG4S
The options then would be whether to send sql-datatype as an annotation/hint in the json schema (which the target still wouldn't do anything with by default) or use JSON schema native validation rules to get to something similar - again, pretty much exactly as @visch notes above. 🙃
b
@aaronsteers thanks to you and @visch for the education on how catalog metadata is utilized and the clear summary of the directions I can take.