Another question re: Singer catalogs I am getting...
# singer-taps
d
Another question re: Singer catalogs I am getting this error:
"-1 is less than the minimum of 0\n\nFailed validating 'minimum' in schema['properties']['properties']['additionalProperties']['properties']['maxLength']:\n    {'minimum': 0, 'type': 'integer'}\n\nOn instance['properties']['project_desc']['maxLength']:\n    -1")
Here is a snippet from my
catalog.json
file
Copy code
{
    "streams": [
      {
        "selected": true,
        "tap_stream_id": "dbo-vwProjectRepl",
        "table_name": "vwProjectRepl",
        "schema": {
          "properties": {
            ...
            "project_desc": {
              "inclusion": "available",
              "maxLength": "8000",
              "type": [
                "null",
                "string"
              ]
            }
            ...
          },
          "type": "object"
        },
        "stream": "vwProjectRepl",
        "metadata": [
          {
            "breadcrumb": [],
            "metadata": {
              "selected": true,
              "replication-method": "INCREMENTAL",
              "replication-key": "updated_at",
              "selected-by-default": false,
              "database-name": "xxx",
              "is-view": true,
              "table-key-properties": []
            }
          },
          ...
          {
            "breadcrumb": [
              "properties",
              "project_desc"
            ],
            "metadata": {
              "selected-by-default": true,
              "sql-datatype": "nvarchar"
            }
          },
          ...
        ]
      }
    ]
  }
Why is it not respecting my settings?
v
Going to need more info here to help. What's the target? At first glance it looks like you believe the validation should be maxlength: 8000 , but it's not. It's currently {'minimum': 0, 'type': 'integer'} Meaning either that's not the catalog.json file, or that's not the right place in the catalog.json file that's correlating to your field? Lots of possibilities. Would be slick if there was an easy debug method for this, not sure what that would look like exactly
When I debug these kinds of things I try to find the stream and a record that's causing the problem. Put those into a file then I pipe them into my target ie TestData:
Copy code
SCHEMA: Blah
Record: Blah
cat testdata | meltano invoke target
Becomes a lot easier imo to debug. Haven't been clever enough to think about debugging Meltano could make easier but there's lots of options I think
Meltano team might have better suggestions 😄
d
Hi @visch The target is
target-postgres
It looks to me like the error is telling me that the value of
maxLength
violates a rule that says it cannot be lower than 0, and that at runtime it is seeing
maxLength: -1
, even though I have my catalogue configured to set it to 8000. Certainly possible that my catalog is not configured correctly... The source filed is
nvarchar(max)
. When I ran
meltano invoke tap-mssql --discover > catalog.json
that particular field did originally have
maxLength: -1
(which kind of makes sense, as nvarchar(max) doesn't really have a practical upper limit) I am quite confident that it is using my catalogue file as the source database has plenty of tables, but I am only selecting three, and Meltano is respecting that selection. It is only importing the three tables that I expect.
It looks like the problem is the schema object that is being sent to
target-postgres
and not what is being read from `tap-mssql`:
Copy code
target-postgres | CRITICAL
`schema` is an invalid JSON Schema instance: 
    {
        "type": "SCHEMA",
        "stream": "xxx-vwProjectRepl",
        "schema": {
            "properties": {
                ...
                "project_desc": {
                    "inclusion": "available",
                    "maxLength": -1,
                    "type": [
                        "null",
                        "string"
                    ]
                },
                ...
            },
            "type": "object"
        },
        "key_properties": []
    }
    
-1 is less than the minimum of 0

Failed validating 'minimum' in schema['properties']['properties']['additionalProperties']['properties']['maxLength']:
    {
        "minimum": 0,
        "type": "integer"
    }

On instance['properties']['project_desc']['maxLength']: -1
I am not sure how to influence this schema definition. I've tried setting in a catalog file, and with extras in
meltano.yml
and relying on the discovered catalogue instead: e.g.:
meltano.yml
Copy code
extractors:
  - name: tap-mssql
    namespace: tap_mssql
    pip_url: -e extract/pipelinewise-tap-mssql
    executable: tap-mssql
    capabilities:
    - catalog
    - discover
    - state
    settings:
    - name: host
    - name: port
      kind: integer
    - name: user
    - name: password
      kind: password
    config:
      host: <http://xxx.database.windows.net|xxx.database.windows.net>
      port: 1433
      user: xxx@xxx
      password: $TAP_MSSQL_PASSWORD
      database: xxx
    select:
    - xxx-vwProjectRepl.*
    schema:
      xxx-vwProjectRepl:
        project_desc:
          type:
          - 'null'
          - string
          maxLength: 8000
    metadata:
      xxx-vwProjectRepl:
        replication-method: INCREMENTAL
        replication-key: updated_at
        updated_at:
          is-replication-key: true
  loaders:
  - name: target-postgres
    variant: datamill-co
    pip_url: singer-target-postgres
    config:
      logging_level: DEBUG
      disable_collection: true
      postgres_host: xxx
      postgres_database: postgres
      postgres_username: postgres
      postgres_schema: $MELTANO_EXTRACT__LOAD_SCHEMA
      invalid_records_detect: false
      persist_empty_tables: true
(I'm aware that
metadata
and
schema
are not used if specifying a catalog file. I have tried it both ways - no luck with either method)
I figured it out... This particular tap does not respect any properties coming from the catalog file. The discovered schema always overrides it. I have resolved the issue by forking the tap and updating the data_type schema logic to account for edge cases such as nvarchar(max) and a few other cases (e.g. reals were getting converted to doubles but truncated at 6 decimal places). I'll submit PRs for these changes back to the original repository.
Is that kind of behaviour expected? Maybe I am mistaken to think that updating the schema in a catalog would affect the schema that is passed to the target?
v
It really depends, I'd think that would be a safe ish assumption, but then that leads to interesting questions like if the catalog says the string should be 50 characters long, but it's 5000 you're saying the tap should truncate it. Implementing that for all scenarios sounds a lot harder than "select column from table". I guess you could validate the json on the tap side but again converting that is hard plus validating the data would be expensive
Awesome that you figured it out. Where my brain tries to go is how in the heck can the orchestrator make your life easier here. It's hard to find the offending record sometimes that would be a nice start I think
One other point regarding the catalog. I think about the catalog as a great place for authoritative information about your tap, and what to pull or not pull (selectected, replication type, etc) Not certain this is the right mental model but it's served me pretty well The catalog is not a place to do transformations on your data
On another side note way off topic I've been thinking about where clauses for database style taps in the catalog. Like where status = active instead of pulling all of the data. Way off topic but curious to think about. For use cases where you don't have an updated at column, and running quickly matters
d
I suppose in this case it was a weird mix of worlds. The principal that the catalog should not be responsible for data transformation is solid. In this instance I had two main cases. 1. an
nvarchar(max)
column 2. a `real`column This particular tap leverages the system catalog for determining the data types and it made two errors. 1.
nvarchar(max)
columns have a "`character_maximum_length`" attribute of
-1
(i.e. no limit). The tap takes this value verbatim and uses it to set the
maxLength
attribute in the catalog. However the JSON Schema validation that was being applied by Singer says that this attribute must be a positive integer. My work around was to detect the -1 for string types and just not set the maximum. Therefore the tap now provides the string as is with
maxLength
no validation for
nvarchar(max)
and
varchar(max)
columns 2.
real
was incorrectly categorised as an
integer
type. I moved it into the list of
FLOAT_TYPES
3. Even after treating
real
as a float type, there was another issue. A
real
is a
single
, but the catalogue converts them to
number
in the JSON Schema which was resulting in a
double
when the schema is created at the Postgres end. The tap was also misconfigured to use the
numeric_scale
attribute, or in the case of a
NULL
set the number of decimal places (
multipleOf
) to
6
. I had some lat/long values in decimal degrees and with all the conversion malarky they ended up with much more than 6 decimal places. I used the same workaround. If
numeric_scale
is
null
then omit the
multipleOf
attribute.
This was a case of the tap making incorrect assumptions about the catalog, and we are unable to override those default validations.
v
Curious... I had do to a few similar things at first with tap-oracle, I removed multipleOf, and I had some default type issues as well. I then went full circle and reinstated multipleOf when I started caring about types, but I started doing some custom handling in target-mssql. As much as Singer targets are supposed to be tap independent depending on how accurate you need data types to be it may not be. Ie I made this big mapping table https://gitlab.com/autoidm/autoidm-target-mssql/-/issues/38 for Oracle -> MSSQL to be sure the data type mappings were correct. For tap/target development around database style taps / targets having some kind of Matrix to "standard" data types might be very helpful. FLOAT, NUMERIC, Timestamp, dates, etc
A test framework standard for taps / targets could probably fill 80% of the gap here. What do you think about this with the hub @aaronsteers ?
I'm still not clear on what would be the most helpful to get people more productive.