Hey guys we are currently experimenting with Meltano and ran Meltano #troubleshooting

Hey guys, we are currently experimenting with Melt...

thomas_schmidt

08/17/2021, 5:12 PM

Hey guys, we are currently experimenting with Meltano and ran into the first issues when using

target-bigquery

with different taps. My specific example is the

stripe-tap

(singer-io variant

git+<https://github.com/singer-io/tap-stripe.git@v1.4.8>

) When running

TARGET_BIGQUERY_DATASET_ID=meltano_stripe meltano elt tap-stripe target-bigquery

it raises the following error

Copy code

target-bigquery | CRITICAL 'RECORD'
target-bigquery | CRITICAL ['Traceback (most recent call last):\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/__init__.py", line 103, in main\n    for state in state_iterator:\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/process.py", line 54, in process\n    for s in handler.handle_record_message(msg):\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/processhandler.py", line 177, in handle_record_message\n    nr = format_record_to_schema(nr, self.bq_schema_dicts[stream])\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/schema.py", line 389, in format_record_to_schema\n    record[k] = conversion_dict[bq_schema[k]["type"]](v)\n', "KeyError: 'RECORD'\n"]
meltano         | Loading failed (2): CRITICAL ['Traceback (most recent call last):\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/__init__.py", line 103, in main\n    for state in state_iterator:\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/process.py", line 54, in process\n    for s in handler.handle_record_message(msg):\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/processhandler.py", line 177, in handle_record_message\n    nr = format_record_to_schema(nr, self.bq_schema_dicts[stream])\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/schema.py", line 389, in format_record_to_schema\n    record[k] = conversion_dict[bq_schema[k]["type"]](v)\n', "KeyError: 'RECORD'\n"]
meltano         | ELT could not be completed: Loader failed
ELT could not be completed: Loader failed

We figured out that this seems to have to do with how the schema is defined in the catalog: e.g. sometimes fields are type

object

and have no properties defined. We managed to patch this by parsing the catalog.json and replacing parts of the schema with a script. However still some fields generate issues: For example when extracting only

customer.*

with

sh extract/tap-stripe-patch/patch-tap-stripe-catalog.sh && TARGET_BIGQUERY_DATASET_ID=meltano_stripe meltano elt tap-stripe target-bigquery --catalog=extract/tap-stripe-patch/tap-stripe-catalog.json

We get the same error. We figured out that the

customers.sources

field is one of the reasons. Now when we exclude it it seems still to cause those errors. When are those fields excluded? Shouldn’t the tap already take care of it? Maybe you guys have some tips and tricks for us. It is currently a big blocker because from the taps we need, the only one we managed to get to run with BigQuery is the

tap-gitlab

one

aaronsteers

08/17/2021, 6:33 PM

We figured out that this seems to have to do with how the schema is defined in the catalog: e.g. sometimes fields are type
object
and have no properties defined.

There's a discussion going on here regarding variant object types in the SDK. While technically speaking, JSON Schema allows variant objects (objects with no defined properties), I am not sure which targets (if any) support this - either at the top level of the stream (unlikely) or in nested nodes (more feasible but also unsure).

aaronsteers

08/17/2021, 6:36 PM

KeyError: 'RECORD'

I could be wrong but this specific key error seems indicative of something else besides a missing property definition - I'd expect a key error on something like 'cust_id' or 'address' or something in the subnodes of the record. It looks like it is having a problem parsing the RECORD message itself - which could indicate a bug in the tap or just confusing logging in the target.

aaronsteers

08/17/2021, 6:37 PM

Have you tried with meltano's

--log-level=debug

flag by chance? This might produce additional hints.

thomas_schmidt

08/18/2021, 4:56 AM

Hey AJ thanks for the hint you actually brought be a bit further: In the

target-bigquery

there is a line

Copy code

record[k] = conversion_dict[bq_schema[k]["type"]](v)

which leads to the error. When I put a breakpoint there we have the following situation

Copy code

k = 'sources'
bq_schema[k] = {'type': 'RECORD', 'mode': 'NULLABLE', 'fields': []}
bq_schema[k]["type"] = 'RECORD'
conversion_dict = {
    'BYTES': <class 'bytes'>, 
    'STRING': <class 'str'>,
    'TIME': <class 'str'>, 
    'TIMESTAMP': <class 'str'>, 
    'DATE': <class 'str'>, 
    'DATETIME': <class 'str'>, 
    'FLOAT': <class 'float'>, 
    'NUMERIC': <class 'float'>, 
    'BIGNUMERIC': <class 'float'>, 
    'INTEGER': <class 'int'>, 
    'BOOLEAN': <class 'bool'>, 
    'GEOGRAPHY': <class 'str'>, 
    'DECIMAL': <class 'str'>, 
    'BIGDECIMAL': <class 'str'>
}

So the

conversion_dict

has no conversion for the type

RECORD

. In the function docstring it says

RECORD is not included into conversion_dict - it is done on purpose. RECORD is handled recursively.

Looking into the catalog the

sources

field is defined as following

Copy code

"sources": {
    "anyOf": [
      {
        "type": [
          "null",
          "array"
        ],
        "items": {
          "type": [
            "null",
            "object"
          ],
          "properties": {...}
        }
      },
      {
        "type": [
          "null",
          "object"
        ],
        "properties": {...}
      }
    ]
  },

So it seems that somewhere in the conversion of the schema something is odd. However I still wonder why unselecting the field does not have any effect. Shouldn’t the tap then also exclude this field from the schema and the record data? Sorry I am pretty new to the Singer Spec

thomas_schmidt

08/18/2021, 8:01 AM

Ok I double checked with the tap. It seems the tap does filter the records fields based on the selection. The schema however is not filtered. Is this intended singer behavior?

aaronsteers

08/18/2021, 1:53 PM

Hi, @thomas_schmidt. The spec is not very specific in this regard, and we've actually found that it's very common for taps to filter the record messages without altering/filtering their corresponding schema message. (The SDK does both automatically, specifically because we wanted this to be consistent and easier for developers of future taps.)

aaronsteers

08/18/2021, 1:56 PM

The recursive behavior sounds like the driving issue here. You may be able to send a custom catalog without that issue, or else (if you have the ability on your source system) you could try casting the data types in the source to be a more compatible type, or perhaps creating a "view" on top of the raw table, with the view having a simplified schema.

thomas_schmidt

08/18/2021, 2:34 PM

Awesome thanks for the info. I was actually thinking about contributing to the tap to additionally filter the schema. I saw that Meltano also forked the

tap-stripe

but it hasn’t been updated for quite a while. Do you by chance know what the best approach for a contribution would be here? I already have some schema filtering running locally

thomas_schmidt

08/19/2021, 7:46 AM

@aaronsteers I actually created an MR in the original singer tap https://github.com/singer-io/tap-stripe/pull/96. Let me know if it is ok that I reused and adjusted some of the code from the Meltano sdk (I want to make sure that everything is fine there). I would also be keen implementing this in the meltano fork and updating it properly

aaronsteers

08/19/2021, 4:22 PM

Nice, @thomas_schmidt! Yeah, you are free to reuse anything in the SDK codebase - we have it licensed with the permissive Apache 2.0 for that reason.

aaronsteers

08/19/2021, 4:24 PM

I was just discussing with @edgar_ramirez_mondragon that we may in the future invest in stable and importable helper functions for use cases like this one - where you want to leverage an SDK capability without fully refactoring the tap.

thomas_schmidt

08/19/2021, 5:50 PM

Nice I think this would be awesome. Maybe even a small helper library one could install in other projects!

edgar_ramirez_mondragon

08/19/2021, 6:28 PM

100%. And also fwiw @prratek_ramchandani started building an sdk-based tap-stripe a while ago: https://github.com/prratek/tap-stripe

prratek_ramchandani

08/19/2021, 6:32 PM

and IIRC i tested ^that with the same bigquery target

thomas_schmidt

08/20/2021, 6:56 AM

Wow cool. @prratek_ramchandani is it in a usable state already? Maybe we could just adapt that one instead and try to push it a bit in case you need support

prratek_ramchandani

08/20/2021, 1:07 PM

umm…i think so? lol sorry i haven’t touched it in a while, but we’ll probably start using it in production soon so if you run into any trouble and open an issue i’d be happy to take a look

daniel_luftspring

02/24/2022, 9:01 PM

Stumbled on this thread and trying to fix a similar issue one of the users of

tap-hellobaton

is having with

target-bigquery

The error message is identical to the one listed here from what i'm gathering the target doesn't handle nullable nested types very well. They aren't using a catalogue at all so i'm wondering if anyone here might be able to help talk me through the fix to

tap-stripe

or at least point me in the right direction.

prratek_ramchandani

02/24/2022, 9:41 PM

to my knowledge the fix is to ensure that you have a fully specified schema. the target doesn't know how to handle cases where you have a field of type

object

but don't specify types or the nested properties so you'd want to make sure any

object

type fields has

properties

where each property also specifies a

type

daniel_luftspring

02/24/2022, 10:09 PM

The payload sends a different set of keys for each record. Any suggestions for where to go or how to set the schema dynamically per record?

prratek_ramchandani

02/24/2022, 11:35 PM

oh yeah that's a little trickier. i haven't used this before but you could try the

force-fields

config option for target-bigquery to coerce the field to a string and then downstream use bigquery's json functions to parse out the nested fields you care about

daniel_luftspring

02/25/2022, 8:13 PM

Wanted to update the thread for posterity and say that in fact the

force-fields

option works in practice. Thank you @prratek_ramchandani!

2 Views

Open in Slack

Previous Next