Denis I.
03/29/2023, 7:49 PMtap-spreadsheets-anywhere
?
I have a .json file with the following structure:
[
{
"key1": "value1",
"key2": {
"key3": "value2",
"key4": "value3"
}
},
{
"key1": "value4",
"key2": {
"key3": "value5",
"key4": "value6"
}
}
]
And the tap outputs object as serialized dict string which is not directly parseable as JSON by PostgreSQL:
{'key3':'value2','key4':'value3'}
The tap outputs only string
, integer
, number
and date-time
types, so no option to use object
. The best solution that comes to my mind is to update tap and output all dicts as strings returned by json.dumps()
. Am I missing something here and changes can broke something for someone using the tap?peter_s
03/30/2023, 12:57 AMobject
as the type during discovery, and output nested JSON during sync.Denis I.
03/30/2023, 1:38 PMobject
type would be detected by default, it would cause troubles for everyone who parses/uses dict-formatted strings downstream.
I didn’t find any discussions related to nested objects output as a string by the tap, so I’m not sure if there is anyone who actually use dict-formatted strings 🙂
So there is two options:
1. Auto-detecting objects by default
2. Explicitly set object
type by using schema
or schema_overrides
@Sven Balnojan maybe you heard something from community related to the topic in this thread?Sven Balnojan
03/30/2023, 2:06 PMvisch
03/30/2023, 4:10 PMaaronsteers
03/30/2023, 5:39 PMWhen I think spreadsheet I think csv, excel file, etc, not nested jsonAgreed - JSONL is very dissimilar to the traditional "spreadsheets" paradigm. I don't see a problem with returning objects except that many targets don't do a good job what I refer to as "variant" object structures, meaning objects without subnodes robustly defined. If the tap can fully describe the nested objects structure, definitely an 'object' data type would make sense. And if the user provides us the declared object definitions, I definitely think we can just apply those and then send the data as an object type. Assuming that this is pretty difficult to accomplish though for some use cases, and there are performance implications with the scanning of robust schema info, I think returning as jsonified string is also totally acceptable - especially since the focus of this tap is 'spreadsheets' use cases. In that case, though, the data type is 'string', not 'object', and the parsing as json would have to occur by consumers of the target table, using the target systems json parsing libraries. To the other point proposal, yes, flattening is also a very good option here. I don't see any concerns with that approach if it solves for this use case. PS - Just my two cents. Probably lots of good ways to solve. And also, my sincere apologies for the length. As the saying goes:
If I had more time, I would have written a shorter letter. 😉
Denis I.
03/31/2023, 11:06 AMIf the tap can fully describe the nested objects structure, definitely an ‘object’ data type would make senseSo that means that it’s a bad idea to output
object
type element with a dictionary itself without any schema for nested elements?
I tested the following model: json with nested elements
→ tap-spreadsheets-anywhere
→ transferwise.target-postgres
Original JSON-file:
[
{ "top_element": { "nested_element1": 1, "nested_element2": 2 },
{ "top_element": { "nested_element1": 3, "nested_element2": 3 },
]
The tap generated schema:
"top_element": {
"type": [
"null",
"object"
]
},
The target inferred table schema:
"top_element" jsonb,
And finally stored json value in DB:
{
"nested_element1": 1,
"nested_element2": 2
}
However, the flattening doesn’t work if the tap doesn’t define schema of nested elements.
schema:
my_json_stream:
top_element:
type: ["null", "object"]
properties:
nested_element1:
type: ["null", "integer"]
nested_element2:
type: ["null", "integer"]
After I explicitly configured schema in the tap’s config and enabled flattening in the target config everything worked well and generated two table fields: top_element__nested_element1
and top_element__nested_element2
.peter_s
04/11/2023, 10:10 PMobject
type or as a nested structure, and during sync emitted them as nested objects? That would allow them to be properly modeled by targets such pipelinewise-target-snowflake and pipelinewise-target-postgres which do know how to handle these objects.
It’s true that JSON / JSONL are dissimilar to the traditional “spreadsheets” paradigm, but this tap has already expanded its scope to partially support these formats, so an option like this would allow it to support them more fully.
And I think it’s quite useful for there to be a single multi-storage-system tap which supports a variety of file formats.Denis I.
04/11/2023, 11:28 PM