I was wondering if I someone could point me in the proper di Meltano #singer-target-development

I was wondering if I someone could point me in the...

BuzzCutNorman

01/24/2023, 4:17 PM

I was wondering if I someone could point me in the proper direction. I have a SQL extractor/tap that places a column's sql-datatype into the metadata section of the catalog generated during the database discovery. I am happy I have this information in the catalog but now it is there I can't figure out how to get the

sql-datatype

value to the

to_sql_type

function of a loader/target. I have poked around the SDK and found a

shema

variable/parameter that contain's the stream/table schema properties but I haven't found where it is initially generated so I could add the

sql-datatype

into it. The other option was to pull it from the schema message again but I haven't come across a clear path to accomplish that either. I am betting its one of those times the answer is staring at me and laughing as I look past it over and over again.

visch

01/24/2023, 9:11 PM

We communicate types via

SCHEMA

messages, not catalogs. It's possible to extend

SCHEMA

message to include the actual

sql-datatype

and use that in the target. That would be an extension of the spec but maybe not horrible. I've done this via the description field for a BLOB but it is hacky. What target are you trying to use? I'm guessing it's a DB target? What column type are you having issues with? We maybe able to get to a better answer if we're lucky and start there

BuzzCutNorman

01/24/2023, 9:17 PM

I am using my target-mssql at the moment. Right now I am doing simple column type like INTEGER() and VARCHAR(). Here is what

meltano invoke tap-mssql --discover

returns which I thought should match up with the schema message passed to the target by the tap.

Copy code

{
      "tap_stream_id": "dbo-tags",
      "table_name": "tags",
      "replication_method": "",
      "key_properties": [
        "Id"
      ],
      "schema": {
        "properties": {
          "Id": {
            "type": [
              "integer"
            ]
          },
          "TagName": {
            "type": [
              "string",
              "null"
            ]
          },
          "Count": {
            "type": [
              "integer",
              "null"
            ]
          },
          "ExcerptPostId": {
            "type": [
              "integer",
              "null"
            ]
          },
          "WikiPostId": {
            "type": [
              "integer",
              "null"
            ]
          }
        },
        "type": "object",
        "required": [
          "Id"
        ]
      },
      "is_view": false,
      "stream": "dbo-tags",
      "metadata": [
        {
          "breadcrumb": [
            "properties",
            "Id"
          ],
          "metadata": {
            "inclusion": "automatic",
            "sql-datatype": "INTEGER()"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "TagName"
          ],
          "metadata": {
            "inclusion": "available",
            "sql-datatype": "VARCHAR()"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "Count"
          ],
          "metadata": {
            "inclusion": "available",
            "sql-datatype": "INTEGER()"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "ExcerptPostId"
          ],
          "metadata": {
            "inclusion": "available",
            "sql-datatype": "INTEGER()"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "WikiPostId"
          ],
          "metadata": {
            "inclusion": "available",
            "sql-datatype": "INTEGER()"
          }
        },
        {
          "breadcrumb": [],
          "metadata": {
            "inclusion": "available",
            "table-key-properties": [
              "Id"
            ],
            "forced-replication-method": "",
            "schema-name": "dbo"
          }
        }
      ]
    },

visch

01/24/2023, 9:29 PM

For Integer and VARCHAR those are handled today very well in the json spec. So in the

SCHEMA

message should work fine. What's the issue you're having with the normal singer spec here?

visch

01/24/2023, 9:30 PM

I guess specifically what's the issue you're having with json schema types

BuzzCutNorman

01/24/2023, 9:33 PM

I am trying to get a more accurate mapping between sql database tables. I want to be able to keep scale and precision of decimals and numerics, the byte size of varchars and nvarchars and also if the string was a varchar or nvarchar on the source side.

BuzzCutNorman

01/24/2023, 9:41 PM

Singer has a the

sql-datatype

built into the spec I think to address this so I don't have an issue with the spec just was trying to implement this piece of it.

visch

01/24/2023, 9:54 PM

scale and precision of decimals and numerics

This is actually handled pretty good with

number

using

multipleOf

Here's an implementation https://gitlab.com/autoidm/autoidm-target-mssql/-/blob/main/target_mssql/streams.py#L200-210

the byte size of varchars and nvarchars

maxLength

handles this https://gitlab.com/autoidm/autoidm-target-mssql/-/blob/main/target_mssql/streams.py#L195 for String types

if the string was a varchar or nvarchar on the source side.

Is a bit deeper. Source side: I don't think this is written down but generally UTF-8 is already what JSON is encoded in. re https://www.rfc-editor.org/rfc/rfc8259

Copy code

JSON text exchanged between systems that are not part of a closed
   ecosystem MUST be encoded using UTF-8 [RFC3629].

I haven't seen folks go too deep on this but that's the general "rule" from the source side. On the target side (target-mssql): Maybe we should just go with

NVARCHAR

as the string type and call it a day?

Singer has a the
sql-datatype
built into the spec I think to address this so I don't have an issue with the spec just was trying to implement this piece of it.

Singer defines this in the

catalog

, not in the

schema

. `schema`'s are the only thing that's sent to the target to use for type information

visch

01/24/2023, 9:55 PM

Where I do see a need for this is custom types like

BLOB

and others and I'm not certain of a great solution

BuzzCutNorman

01/24/2023, 10:10 PM

I get what you are saying about JSON and it can take you a long way. Below is what I got done for tap-mssql before I started down the path of trying to utilize

sql-datatype

. Which from what you are saying I would have add

sql-datatype

to the schema message sent by the tap since the metadata is not natively included. The metadata is only used on the tap side before sending a schema message to the target? The schema message spec doesn't normally allow for this from the sounds of it ? https://github.com/BuzzCutNorman/tap-mssql/blob/328d375bcc95f8bb9a8abf86fd8bbd597e5f3990/tap_mssql/client.py#L79-L202

visch

01/24/2023, 10:52 PM

The metadata is only used on the tap side before sending a schema message to the target?

Yes

The schema message spec doesn't normally allow for this from the sounds of it ?

Well It's a JSON Schema so you can do a lot with it (Json Schema Draft 4) , there's nothing saying you couldn't use the

description

field but I think a

custom

type would be better (I think that's possible with draft4) The issue isn't so much the custom types as it is with every target should work with every tap. Therefore custom types from a

tap

are really a no go. But maybe there could be a "hints" mechanism

BuzzCutNorman

01/24/2023, 11:12 PM

Ah I see, Thanks for the explanation. I feel a little better knowing I wasn't crazy and I didn't miss anything on the target side. The bummer is I am kind of back where I was a while back. At least I am armed with more info this time. I'm starting to wonder if effort put toward coming up with common/standardized JSON schema fingerprints for ANSI SQL types would be a better in the long run.

visch

01/24/2023, 11:30 PM

I wonder if we could make sure whatever types we chose in JSONSchema could be lossless. Then if you really wanted exact types you could ingest the

manifest.json

and implement the same thing. I don't love that either

BuzzCutNorman

01/24/2023, 11:33 PM

I know, each time I look at this and think I have a great idea. It falls apart on me. I am starting to feel like Charlie Brown getting the football pulled away by Lucy.

BuzzCutNorman

01/24/2023, 11:38 PM

Like you said I want a way to communicate exact sql types that works with the normal tap tartget schema messages communication method and is JSON Schema compliant without the use of custom types.

aaronsteers

01/25/2023, 12:24 AM

@BuzzCutNorman - What are the source and destination, out of curiosity? Is this SQL Server to SQL Server?

aaronsteers

01/25/2023, 12:27 AM

As you are finding, the

sql-datatype

is treated as an 'annotation' to document upstream type, and not necessarily as a dictate of downstream storage type.

aaronsteers

01/25/2023, 12:28 AM

There are good reasons, even when high-fidelity information is available, for targets to ignore it and still build using generic widely-defined types.

aaronsteers

01/25/2023, 12:29 AM

I don't know of any targets which use the

sql-datatype

as a creation hint (although some might), in part because different systems will size the same type differently, and

INTEGER

isn't going to be the same size on every source and target system.

aaronsteers

01/25/2023, 12:31 AM

I have seen some systems use the

multipleOf

hint to provide transferrable precision constraints, such as

multipleOf: 0.01

being used for something decimal-like with two precision points.

aaronsteers

01/25/2023, 12:32 AM

All that said, the reason why I ask if your source is the same type as the target (both MS SQL Server) is because you could at least in theory use the

sql-datatype

in the target to give a high-fidelity copy of data types as well as the data itself.

aaronsteers

01/25/2023, 12:34 AM

Here is what
meltano invoke tap-mssql --discover
returns

A good test before moving forward would be to double check that when you run a sync like

meltano invoke tap-mssql > outfile.jsonl

that you also see your

sql-datatype

annotations in the

SCHEMA

messages - since those are what get passed to the target.

aaronsteers

01/25/2023, 12:40 AM

Assuming the

sql-datatype

annotations are reaching the target, you could use those in the target layer to guide type selection. If other users are going to leverage the same target, this would need to be an opt-in config decision from the user, because the default behavior should not assume the tightest possible column fit, and also would need the users' blessing to know the types of the upstream type are compatible with the target's type system. And related to the above

multipleOf

discussion, this would presumably override those native JSON Schema hints in favor of trying to honor whereever possible the received

sql-datatype

annotation in the schema.

aaronsteers

01/25/2023, 12:43 AM

Let me know if this is helpful! I can't answer for you if it's worth the wrangling to get this done, but in theory it is possible. 🤷

visch

01/25/2023, 12:56 AM

@aaronsteers he was asking how to get

sql-datatype

annotations in the

schema

I think. I didn't dive into the SDK to look maybe you know off hand?

aaronsteers

01/25/2023, 1:16 AM

It might be that

sql-datatype

is getting into the

SCHEMA

message but @BuzzCutNorman would need to confirm that.

BuzzCutNorman

01/25/2023, 1:20 AM

@aaronsteers Thanks for all the information and a glimmer of hope this might be possible. The source and destination right now are SQL Server to SQL Server to keep things simple. I completely get that it might be best to use a generic type and yes there are variation between RDMS implementations of datatypes. I have not tried the

meltano invoke tap-mssql > outfile.jsonl

but will this evening sometime and let you know what I get.

BuzzCutNorman

01/25/2023, 3:14 AM

@aaronsteers unfortunately

sql-datatype

is not present in the

SCHEMA

message sent by my tap-mssql.

Copy code

{
  "type": "SCHEMA",
  "stream": "dbo-tags",
  "schema": {
    "properties": {
      "Id": {
        "type": [
          "integer"
        ]
      },
      "TagName": {
        "type": [
          "string",
          "null"
        ]
      },
      "Count": {
        "type": [
          "integer",
          "null"
        ]
      },
      "ExcerptPostId": {
        "type": [
          "integer",
          "null"
        ]
      },
      "WikiPostId": {
        "type": [
          "integer",
          "null"
        ]
      }
    },
    "type": "object",
    "required": [
      "Id"
    ]
  },
  "key_properties": [
    "Id"
  ]
}

aaronsteers

01/25/2023, 4:10 AM

Thanks, @BuzzCutNorman for confirming. I think this validates what @visch calls out here: https://meltano.slack.com/archives/C01RKUVUG4S/p1674600771626969?thread_ts=1674577030.714159&cid=C01RKUVUG4S

aaronsteers

01/25/2023, 4:12 AM

The options then would be whether to send sql-datatype as an annotation/hint in the json schema (which the target still wouldn't do anything with by default) or use JSON schema native validation rules to get to something similar - again, pretty much exactly as @visch notes above. 🙃

BuzzCutNorman

01/25/2023, 4:20 PM

@aaronsteers thanks to you and @visch for the education on how catalog metadata is utilized and the clear summary of the directions I can take.

7 Views

Open in Slack

Previous Next