Hey guys, we are currently experimenting with Melt...
# troubleshooting
t
Hey guys, we are currently experimenting with Meltano and ran into the first issues when using
target-bigquery
with different taps. My specific example is the
stripe-tap
(singer-io variant
git+<https://github.com/singer-io/tap-stripe.git@v1.4.8>
) When running
TARGET_BIGQUERY_DATASET_ID=meltano_stripe meltano elt tap-stripe target-bigquery
it raises the following error
Copy code
target-bigquery | CRITICAL 'RECORD'
target-bigquery | CRITICAL ['Traceback (most recent call last):\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/__init__.py", line 103, in main\n    for state in state_iterator:\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/process.py", line 54, in process\n    for s in handler.handle_record_message(msg):\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/processhandler.py", line 177, in handle_record_message\n    nr = format_record_to_schema(nr, self.bq_schema_dicts[stream])\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/schema.py", line 389, in format_record_to_schema\n    record[k] = conversion_dict[bq_schema[k]["type"]](v)\n', "KeyError: 'RECORD'\n"]
meltano         | Loading failed (2): CRITICAL ['Traceback (most recent call last):\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/__init__.py", line 103, in main\n    for state in state_iterator:\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/process.py", line 54, in process\n    for s in handler.handle_record_message(msg):\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/processhandler.py", line 177, in handle_record_message\n    nr = format_record_to_schema(nr, self.bq_schema_dicts[stream])\n', '  File "/Users/thomas/Agrando/agr-meltano/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/schema.py", line 389, in format_record_to_schema\n    record[k] = conversion_dict[bq_schema[k]["type"]](v)\n', "KeyError: 'RECORD'\n"]
meltano         | ELT could not be completed: Loader failed
ELT could not be completed: Loader failed
We figured out that this seems to have to do with how the schema is defined in the catalog: e.g. sometimes fields are type
object
and have no properties defined. We managed to patch this by parsing the catalog.json and replacing parts of the schema with a script. However still some fields generate issues: For example when extracting only
customer.*
with
sh extract/tap-stripe-patch/patch-tap-stripe-catalog.sh && TARGET_BIGQUERY_DATASET_ID=meltano_stripe meltano elt tap-stripe target-bigquery --catalog=extract/tap-stripe-patch/tap-stripe-catalog.json
We get the same error. We figured out that the
customers.sources
field is one of the reasons. Now when we exclude it it seems still to cause those errors. When are those fields excluded? Shouldn’t the tap already take care of it? Maybe you guys have some tips and tricks for us. It is currently a big blocker because from the taps we need, the only one we managed to get to run with BigQuery is the
tap-gitlab
one
a
We figured out that this seems to have to do with how the schema is defined in the catalog:  e.g. sometimes fields are type 
object
 and have no properties defined.
There's a discussion going on here regarding variant object types in the SDK. While technically speaking, JSON Schema allows variant objects (objects with no defined properties), I am not sure which targets (if any) support this - either at the top level of the stream (unlikely) or in nested nodes (more feasible but also unsure).
KeyError: 'RECORD'
I could be wrong but this specific key error seems indicative of something else besides a missing property definition - I'd expect a key error on something like 'cust_id' or 'address' or something in the subnodes of the record. It looks like it is having a problem parsing the RECORD message itself - which could indicate a bug in the tap or just confusing logging in the target.
Have you tried with meltano's
--log-level=debug
flag by chance? This might produce additional hints.
t
Hey AJ thanks for the hint you actually brought be a bit further: In the
target-bigquery
there is a line
Copy code
record[k] = conversion_dict[bq_schema[k]["type"]](v)
which leads to the error. When I put a breakpoint there we have the following situation
Copy code
k = 'sources'
bq_schema[k] = {'type': 'RECORD', 'mode': 'NULLABLE', 'fields': []}
bq_schema[k]["type"] = 'RECORD'
conversion_dict = {
    'BYTES': <class 'bytes'>, 
    'STRING': <class 'str'>,
    'TIME': <class 'str'>, 
    'TIMESTAMP': <class 'str'>, 
    'DATE': <class 'str'>, 
    'DATETIME': <class 'str'>, 
    'FLOAT': <class 'float'>, 
    'NUMERIC': <class 'float'>, 
    'BIGNUMERIC': <class 'float'>, 
    'INTEGER': <class 'int'>, 
    'BOOLEAN': <class 'bool'>, 
    'GEOGRAPHY': <class 'str'>, 
    'DECIMAL': <class 'str'>, 
    'BIGDECIMAL': <class 'str'>
}
So the
conversion_dict
has no conversion for the type
RECORD
. In the function docstring it says
RECORD is not included into conversion_dict - it is done on purpose. RECORD is handled recursively.
Looking into the catalog the
sources
field is defined as following
Copy code
"sources": {
    "anyOf": [
      {
        "type": [
          "null",
          "array"
        ],
        "items": {
          "type": [
            "null",
            "object"
          ],
          "properties": {...}
        }
      },
      {
        "type": [
          "null",
          "object"
        ],
        "properties": {...}
      }
    ]
  },
So it seems that somewhere in the conversion of the schema something is odd. However I still wonder why unselecting the field does not have any effect. Shouldn’t the tap then also exclude this field from the schema and the record data? Sorry I am pretty new to the Singer Spec
Ok I double checked with the tap. It seems the tap does filter the records fields based on the selection. The schema however is not filtered. Is this intended singer behavior?
a
Hi, @thomas_schmidt. The spec is not very specific in this regard, and we've actually found that it's very common for taps to filter the record messages without altering/filtering their corresponding schema message. (The SDK does both automatically, specifically because we wanted this to be consistent and easier for developers of future taps.)
The recursive behavior sounds like the driving issue here. You may be able to send a custom catalog without that issue, or else (if you have the ability on your source system) you could try casting the data types in the source to be a more compatible type, or perhaps creating a "view" on top of the raw table, with the view having a simplified schema.
t
Awesome thanks for the info. I was actually thinking about contributing to the tap to additionally filter the schema. I saw that Meltano also forked the
tap-stripe
but it hasn’t been updated for quite a while. Do you by chance know what the best approach for a contribution would be here? I already have some schema filtering running locally
@aaronsteers I actually created an MR in the original singer tap https://github.com/singer-io/tap-stripe/pull/96. Let me know if it is ok that I reused and adjusted some of the code from the Meltano sdk (I want to make sure that everything is fine there). I would also be keen implementing this in the meltano fork and updating it properly
a
Nice, @thomas_schmidt! Yeah, you are free to reuse anything in the SDK codebase - we have it licensed with the permissive Apache 2.0 for that reason.
I was just discussing with @edgar_ramirez_mondragon that we may in the future invest in stable and importable helper functions for use cases like this one - where you want to leverage an SDK capability without fully refactoring the tap.
t
Nice I think this would be awesome. Maybe even a small helper library one could install in other projects!
e
100%. And also fwiw @prratek_ramchandani started building an sdk-based tap-stripe a while ago: https://github.com/prratek/tap-stripe
p
and IIRC i tested ^that with the same bigquery target
t
Wow cool. @prratek_ramchandani is it in a usable state already? Maybe we could just adapt that one instead and try to push it a bit in case you need support
p
umm…i think so? lol sorry i haven’t touched it in a while, but we’ll probably start using it in production soon so if you run into any trouble and open an issue i’d be happy to take a look
d
Stumbled on this thread and trying to fix a similar issue one of the users of
tap-hellobaton
is having with
target-bigquery
The error message is identical to the one listed here from what i'm gathering the target doesn't handle nullable nested types very well. They aren't using a catalogue at all so i'm wondering if anyone here might be able to help talk me through the fix to
tap-stripe
or at least point me in the right direction.
p
to my knowledge the fix is to ensure that you have a fully specified schema. the target doesn't know how to handle cases where you have a field of type
object
but don't specify types or the nested properties so you'd want to make sure any
object
type fields has
properties
where each property also specifies a
type
d
The payload sends a different set of keys for each record. Any suggestions for where to go or how to set the schema dynamically per record?
p
oh yeah that's a little trickier. i haven't used this before but you could try the
force-fields
config option for target-bigquery to coerce the field to a string and then downstream use bigquery's json functions to parse out the nested fields you care about
d
Wanted to update the thread for posterity and say that in fact the
force-fields
option works in practice. Thank you @prratek_ramchandani!