Another question re Singer catalogs I am getting this error Meltano #singer-taps

Another question re: Singer catalogs I am getting...

david_tout

06/18/2021, 10:30 AM

Another question re: Singer catalogs I am getting this error:

"-1 is less than the minimum of 0\n\nFailed validating 'minimum' in schema['properties']['properties']['additionalProperties']['properties']['maxLength']:\n    {'minimum': 0, 'type': 'integer'}\n\nOn instance['properties']['project_desc']['maxLength']:\n    -1")

Here is a snippet from my

catalog.json

file

Copy code

{
    "streams": [
      {
        "selected": true,
        "tap_stream_id": "dbo-vwProjectRepl",
        "table_name": "vwProjectRepl",
        "schema": {
          "properties": {
            ...
            "project_desc": {
              "inclusion": "available",
              "maxLength": "8000",
              "type": [
                "null",
                "string"
              ]
            }
            ...
          },
          "type": "object"
        },
        "stream": "vwProjectRepl",
        "metadata": [
          {
            "breadcrumb": [],
            "metadata": {
              "selected": true,
              "replication-method": "INCREMENTAL",
              "replication-key": "updated_at",
              "selected-by-default": false,
              "database-name": "xxx",
              "is-view": true,
              "table-key-properties": []
            }
          },
          ...
          {
            "breadcrumb": [
              "properties",
              "project_desc"
            ],
            "metadata": {
              "selected-by-default": true,
              "sql-datatype": "nvarchar"
            }
          },
          ...
        ]
      }
    ]
  }

Why is it not respecting my settings?

visch

06/18/2021, 12:26 PM

Going to need more info here to help. What's the target? At first glance it looks like you believe the validation should be maxlength: 8000 , but it's not. It's currently {'minimum': 0, 'type': 'integer'} Meaning either that's not the catalog.json file, or that's not the right place in the catalog.json file that's correlating to your field? Lots of possibilities. Would be slick if there was an easy debug method for this, not sure what that would look like exactly

visch

06/18/2021, 12:28 PM

When I debug these kinds of things I try to find the stream and a record that's causing the problem. Put those into a file then I pipe them into my target ie TestData:

Copy code

SCHEMA: Blah
Record: Blah

cat testdata | meltano invoke target

Becomes a lot easier imo to debug. Haven't been clever enough to think about debugging Meltano could make easier but there's lots of options I think

visch

06/18/2021, 12:29 PM

Meltano team might have better suggestions 😄

david_tout

06/20/2021, 1:54 AM

Hi @visch The target is

target-postgres

It looks to me like the error is telling me that the value of

maxLength

violates a rule that says it cannot be lower than 0, and that at runtime it is seeing

maxLength: -1

, even though I have my catalogue configured to set it to 8000. Certainly possible that my catalog is not configured correctly... The source filed is

nvarchar(max)

. When I ran

meltano invoke tap-mssql --discover > catalog.json

that particular field did originally have

maxLength: -1

(which kind of makes sense, as nvarchar(max) doesn't really have a practical upper limit) I am quite confident that it is using my catalogue file as the source database has plenty of tables, but I am only selecting three, and Meltano is respecting that selection. It is only importing the three tables that I expect.

david_tout

06/20/2021, 3:44 AM

It looks like the problem is the schema object that is being sent to

target-postgres

and not what is being read from `tap-mssql`:

Copy code

target-postgres | CRITICAL
`schema` is an invalid JSON Schema instance: 
    {
        "type": "SCHEMA",
        "stream": "xxx-vwProjectRepl",
        "schema": {
            "properties": {
                ...
                "project_desc": {
                    "inclusion": "available",
                    "maxLength": -1,
                    "type": [
                        "null",
                        "string"
                    ]
                },
                ...
            },
            "type": "object"
        },
        "key_properties": []
    }
    
-1 is less than the minimum of 0

Failed validating 'minimum' in schema['properties']['properties']['additionalProperties']['properties']['maxLength']:
    {
        "minimum": 0,
        "type": "integer"
    }

On instance['properties']['project_desc']['maxLength']: -1

I am not sure how to influence this schema definition. I've tried setting in a catalog file, and with extras in

meltano.yml

and relying on the discovered catalogue instead: e.g.:

meltano.yml

Copy code

extractors:
  - name: tap-mssql
    namespace: tap_mssql
    pip_url: -e extract/pipelinewise-tap-mssql
    executable: tap-mssql
    capabilities:
    - catalog
    - discover
    - state
    settings:
    - name: host
    - name: port
      kind: integer
    - name: user
    - name: password
      kind: password
    config:
      host: <http://xxx.database.windows.net|xxx.database.windows.net>
      port: 1433
      user: xxx@xxx
      password: $TAP_MSSQL_PASSWORD
      database: xxx
    select:
    - xxx-vwProjectRepl.*
    schema:
      xxx-vwProjectRepl:
        project_desc:
          type:
          - 'null'
          - string
          maxLength: 8000
    metadata:
      xxx-vwProjectRepl:
        replication-method: INCREMENTAL
        replication-key: updated_at
        updated_at:
          is-replication-key: true
  loaders:
  - name: target-postgres
    variant: datamill-co
    pip_url: singer-target-postgres
    config:
      logging_level: DEBUG
      disable_collection: true
      postgres_host: xxx
      postgres_database: postgres
      postgres_username: postgres
      postgres_schema: $MELTANO_EXTRACT__LOAD_SCHEMA
      invalid_records_detect: false
      persist_empty_tables: true

david_tout

06/20/2021, 3:47 AM

(I'm aware that

metadata

and

schema

are not used if specifying a catalog file. I have tried it both ways - no luck with either method)

david_tout

06/20/2021, 6:04 AM

I figured it out... This particular tap does not respect any properties coming from the catalog file. The discovered schema always overrides it. I have resolved the issue by forking the tap and updating the data_type schema logic to account for edge cases such as nvarchar(max) and a few other cases (e.g. reals were getting converted to doubles but truncated at 6 decimal places). I'll submit PRs for these changes back to the original repository.

david_tout

06/20/2021, 6:05 AM

Is that kind of behaviour expected? Maybe I am mistaken to think that updating the schema in a catalog would affect the schema that is passed to the target?

visch

06/20/2021, 10:47 AM

It really depends, I'd think that would be a safe ish assumption, but then that leads to interesting questions like if the catalog says the string should be 50 characters long, but it's 5000 you're saying the tap should truncate it. Implementing that for all scenarios sounds a lot harder than "select column from table". I guess you could validate the json on the tap side but again converting that is hard plus validating the data would be expensive

visch

06/20/2021, 10:48 AM

Awesome that you figured it out. Where my brain tries to go is how in the heck can the orchestrator make your life easier here. It's hard to find the offending record sometimes that would be a nice start I think

visch

06/20/2021, 10:51 AM

One other point regarding the catalog. I think about the catalog as a great place for authoritative information about your tap, and what to pull or not pull (selectected, replication type, etc) Not certain this is the right mental model but it's served me pretty well The catalog is not a place to do transformations on your data

visch

06/20/2021, 10:54 AM

On another side note way off topic I've been thinking about where clauses for database style taps in the catalog. Like where status = active instead of pulling all of the data. Way off topic but curious to think about. For use cases where you don't have an updated at column, and running quickly matters

david_tout

06/20/2021, 11:06 PM

I suppose in this case it was a weird mix of worlds. The principal that the catalog should not be responsible for data transformation is solid. In this instance I had two main cases. 1. an

nvarchar(max)

column 2. a `real`column This particular tap leverages the system catalog for determining the data types and it made two errors. 1.

nvarchar(max)

columns have a "`character_maximum_length`" attribute of

-1

(i.e. no limit). The tap takes this value verbatim and uses it to set the

maxLength

attribute in the catalog. However the JSON Schema validation that was being applied by Singer says that this attribute must be a positive integer. My work around was to detect the -1 for string types and just not set the maximum. Therefore the tap now provides the string as is with

maxLength

no validation for

nvarchar(max)

and

varchar(max)

columns 2.

real

was incorrectly categorised as an

integer

type. I moved it into the list of

FLOAT_TYPES

3. Even after treating

real

as a float type, there was another issue. A

real

is a

single

, but the catalogue converts them to

number

in the JSON Schema which was resulting in a

double

when the schema is created at the Postgres end. The tap was also misconfigured to use the

numeric_scale

attribute, or in the case of a

NULL

set the number of decimal places (

multipleOf

) to

. I had some lat/long values in decimal degrees and with all the conversion malarky they ended up with much more than 6 decimal places. I used the same workaround. If

numeric_scale

null

then omit the

multipleOf

attribute.

david_tout

06/20/2021, 11:07 PM

This was a case of the tap making incorrect assumptions about the catalog, and we are unable to override those default validations.

visch

06/21/2021, 1:10 PM

Curious... I had do to a few similar things at first with tap-oracle, I removed multipleOf, and I had some default type issues as well. I then went full circle and reinstated multipleOf when I started caring about types, but I started doing some custom handling in target-mssql. As much as Singer targets are supposed to be tap independent depending on how accurate you need data types to be it may not be. Ie I made this big mapping table https://gitlab.com/autoidm/autoidm-target-mssql/-/issues/38 for Oracle -> MSSQL to be sure the data type mappings were correct. For tap/target development around database style taps / targets having some kind of Matrix to "standard" data types might be very helpful. FLOAT, NUMERIC, Timestamp, dates, etc

visch

06/21/2021, 1:12 PM

A test framework standard for taps / targets could probably fill 80% of the gap here. What do you think about this with the hub @aaronsteers ?

visch

06/21/2021, 1:12 PM

I'm still not clear on what would be the most helpful to get people more productive.

5 Views

Open in Slack

Previous Next