Got some arbitrary json data Right now it s going to BigQuer Meltano #singer-tap-development

Got some arbitrary json data. Right now it’s goin...

sterling_paramore

05/20/2022, 12:27 AM

Got some arbitrary json data. Right now it’s going to BigQuery but I’d also like to support loading into a Snowflake variant column. The

ObjectType

seems to require that I know the full schema. How can I support arbitrary object schemas?

edgar_ramirez_mondragon

05/20/2022, 12:48 AM

hey @sterling_paramore! would an instantiation with no args, e.g.

ObjectType()

work? Or I guess, what's the raw JSON schema you expect? You could also use a CustomType with the exact JSON schema if no of the available typing classes is useful)

sterling_paramore

05/20/2022, 2:56 PM

Using

ObjectType()

works when the target is

target-jsonl

, but it fails when it’s

target-bigquery

. With bigquery, I get this warning in the log

Copy code

2022-05-20T14:49:21.437306Z [info     ] WARNING the pipeline might fail because of undefined fields: an empty object/dictionary indicated as {} cmd_type=loader job_id=testbq name=target-bigquery run_id=983b61ea-9766-476f-85ae-93a610fbdd60 stdio=stderr

This warning happens right after it prints the schema to the log, which I’ve truncated here for readability (the table is

wo_tags

and the field causing problems is

record_owner

which is a json field in the source:

Copy code

2022-05-20T14:49:21.431483Z [info     ] INFO wo_tags schema: {'properties': {'wo_tag': {'type': ['string', 'null']}, 'related_part_number': {'type': ['string', 'null']}, ... 'record_owner': {'properties': {}, 'type': ['object', 'null']}, ...}, 'type': 'object'} cmd_type=loader job_id=testbq name=target-bigquery run_id=983b61ea-9766-476f-85ae-93a610fbdd60 stdio=stderr
2022-05-20T14:49:21.437306Z [info     ] WARNING the pipeline might fail because of undefined fields: an empty object/dictionary indicated as {} cmd_type=loader job_id=testbq name=target-bigquery run_id=983b61ea-9766-476f-85ae-93a610fbdd60 stdio=stderr

Note the

'properties': {}

bit. The error happens after some of the records are output to the log. The actual error seems to come from the bigquery target:

Copy code

2022-05-20T14:49:25.122330Z [error    ] Loading failed                 code=2 job_id=testbq message=CRITICAL ['Traceback (most recent call last):\n', '  File "/Users/sterling.paramore/code/meltano-projects/tap-quickbase-json/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/__init__.py", line 129, in main\n    for state in state_iterator:\n', '  File "/Users/sterling.paramore/code/meltano-projects/tap-quickbase-json/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/process.py", line 54, in process\n    for s in handler.handle_record_message(msg):\n', '  File "/Users/sterling.paramore/code/meltano-projects/tap-quickbase-json/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/processhandler.py", line 179, in handle_record_message\n    nr = format_record_to_schema(nr, self.bq_schema_dicts[stream])\n', '  File "/Users/sterling.paramore/code/meltano-projects/tap-quickbase-json/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/schema.py", line 359, in format_record_to_schema\n    record[k] = conversion_dict[bq_schema[k]["type"]](v)\n', "KeyError: 'RECORD'\n"] name=meltano run_id=983b61ea-9766-476f-85ae-93a610fbdd60

sterling_paramore

05/20/2022, 3:04 PM

Hrmmm…. looks like this is maybe a known issue.

edgar_ramirez_mondragon

05/20/2022, 3:55 PM

Ah yes, some targets do a better job than others. Shame that it won't likely be solved in that case: https://github.com/adswerve/target-bigquery/issues/20#issuecomment-906621280 I know jmriego/pipelinewise-target-bigquery does support it, as I added the test myself 😅 : https://github.com/jmriego/pipelinewise-target-bigquery/blame/92b30aafe297e8ec02579388e0364097be97c29f/tests/unit/test_db_sync.py#L81-L84

sterling_paramore

05/20/2022, 4:02 PM

I don’t really like that reasoning. Some json data is schemaless. I know BigQuery doesn’t support that kind of data well, but other targets do. The whole point of the singer spec is that I don’t have to think much about my targets when I’m developing a tap.

nigel_vining

05/28/2022, 1:35 AM

In the past as a workaround for BQ (when the schema isnt well known or properly defined), i’ve dumped the data as a key|value, ie two columns in BQ, a key, and then dump the complete payload as-is into a string column then parse it later … but that really just moves the need to parse the structure upstream a bit more 🙄

4 Views

Open in Slack

Previous Next