Got some arbitrary json data. Right now it’s goin...
# singer-tap-development
s
Got some arbitrary json data. Right now it’s going to BigQuery but I’d also like to support loading into a Snowflake variant column. The
ObjectType
seems to require that I know the full schema. How can I support arbitrary object schemas?
e
hey @sterling_paramore! would an instantiation with no args, e.g.
ObjectType()
work? Or I guess, what's the raw JSON schema you expect? You could also use a CustomType with the exact JSON schema if no of the available typing classes is useful)
s
Using
ObjectType()
works when the target is
target-jsonl
, but it fails when it’s
target-bigquery
. With bigquery, I get this warning in the log
Copy code
2022-05-20T14:49:21.437306Z [info     ] WARNING the pipeline might fail because of undefined fields: an empty object/dictionary indicated as {} cmd_type=loader job_id=testbq name=target-bigquery run_id=983b61ea-9766-476f-85ae-93a610fbdd60 stdio=stderr
This warning happens right after it prints the schema to the log, which I’ve truncated here for readability (the table is
wo_tags
and the field causing problems is
record_owner
which is a json field in the source:
Copy code
2022-05-20T14:49:21.431483Z [info     ] INFO wo_tags schema: {'properties': {'wo_tag': {'type': ['string', 'null']}, 'related_part_number': {'type': ['string', 'null']}, ... 'record_owner': {'properties': {}, 'type': ['object', 'null']}, ...}, 'type': 'object'} cmd_type=loader job_id=testbq name=target-bigquery run_id=983b61ea-9766-476f-85ae-93a610fbdd60 stdio=stderr
2022-05-20T14:49:21.437306Z [info     ] WARNING the pipeline might fail because of undefined fields: an empty object/dictionary indicated as {} cmd_type=loader job_id=testbq name=target-bigquery run_id=983b61ea-9766-476f-85ae-93a610fbdd60 stdio=stderr
Note the
'properties': {}
bit. The error happens after some of the records are output to the log. The actual error seems to come from the bigquery target:
Copy code
2022-05-20T14:49:25.122330Z [error    ] Loading failed                 code=2 job_id=testbq message=CRITICAL ['Traceback (most recent call last):\n', '  File "/Users/sterling.paramore/code/meltano-projects/tap-quickbase-json/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/__init__.py", line 129, in main\n    for state in state_iterator:\n', '  File "/Users/sterling.paramore/code/meltano-projects/tap-quickbase-json/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/process.py", line 54, in process\n    for s in handler.handle_record_message(msg):\n', '  File "/Users/sterling.paramore/code/meltano-projects/tap-quickbase-json/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/processhandler.py", line 179, in handle_record_message\n    nr = format_record_to_schema(nr, self.bq_schema_dicts[stream])\n', '  File "/Users/sterling.paramore/code/meltano-projects/tap-quickbase-json/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/schema.py", line 359, in format_record_to_schema\n    record[k] = conversion_dict[bq_schema[k]["type"]](v)\n', "KeyError: 'RECORD'\n"] name=meltano run_id=983b61ea-9766-476f-85ae-93a610fbdd60
Hrmmm…. looks like this is maybe a known issue.
e
s
I don’t really like that reasoning. Some json data is schemaless. I know BigQuery doesn’t support that kind of data well, but other targets do. The whole point of the singer spec is that I don’t have to think much about my targets when I’m developing a tap.
n
In the past as a workaround for BQ (when the schema isnt well known or properly defined), i’ve dumped the data as a key|value, ie two columns in BQ, a key, and then dump the complete payload as-is into a string column then parse it later … but that really just moves the need to parse the structure upstream a bit more 🙄