Have anybody else tried to get nested json objects...
# troubleshooting
d
Have anybody else tried to get nested json objects from
tap-spreadsheets-anywhere
? I have a .json file with the following structure:
Copy code
[
    {
        "key1": "value1",
        "key2": {
            "key3": "value2",
            "key4": "value3"
        }
    },
    {
        "key1": "value4",
        "key2": {
            "key3": "value5",
            "key4": "value6"
        }
    }
]
And the tap outputs object as serialized dict string which is not directly parseable as JSON by PostgreSQL:
Copy code
{'key3':'value2','key4':'value3'}
The tap outputs only
string
,
integer
,
number
and
date-time
types, so no option to use
object
. The best solution that comes to my mind is to update tap and output all dicts as strings returned by
json.dumps()
. Am I missing something here and changes can broke something for someone using the tap?
p
I’ve had a similar experience, though my target is Snowflake. I’d like to see the tap output
object
as the type during discovery, and output nested JSON during sync.
d
I’ve made changes for the tap locally, everything works like a charm, even target flattening feature! The only problem I see is the discovery. If
object
type would be detected by default, it would cause troubles for everyone who parses/uses dict-formatted strings downstream. I didn’t find any discussions related to nested objects output as a string by the tap, so I’m not sure if there is anyone who actually use dict-formatted strings 🙂 So there is two options: 1. Auto-detecting objects by default 2. Explicitly set
object
type by using
schema
or
schema_overrides
@Sven Balnojan maybe you heard something from community related to the topic in this thread?
s
@Denis I. I thought I saw a discussion on this topic in slack sometime ago (involving mappers as well I think). I don't think I have anything to add here 😄 I'm going to tag the super busy @aaronsteers (and @visch), maybe he does know whether this is already an ongoing discussion somewhere on the singer side of things.
d
I’ve opened an issue and committed implementation of the 1st option here. The 2nd option could be implemented by one line change here:
desired_type in (None, 'object')
→`desired_type in ('object')`
v
When I think spreadsheet I think csv, excel file, etc, not nested json. My first thought is tap-spreadsheets doesn't make a ton of sense for nested json 🤷
a
When I think spreadsheet I think csv, excel file, etc, not nested json
Agreed - JSONL is very dissimilar to the traditional "spreadsheets" paradigm. I don't see a problem with returning objects except that many targets don't do a good job what I refer to as "variant" object structures, meaning objects without subnodes robustly defined. If the tap can fully describe the nested objects structure, definitely an 'object' data type would make sense. And if the user provides us the declared object definitions, I definitely think we can just apply those and then send the data as an object type. Assuming that this is pretty difficult to accomplish though for some use cases, and there are performance implications with the scanning of robust schema info, I think returning as jsonified string is also totally acceptable - especially since the focus of this tap is 'spreadsheets' use cases. In that case, though, the data type is 'string', not 'object', and the parsing as json would have to occur by consumers of the target table, using the target systems json parsing libraries. To the other point proposal, yes, flattening is also a very good option here. I don't see any concerns with that approach if it solves for this use case. PS - Just my two cents. Probably lots of good ways to solve. And also, my sincere apologies for the length. As the saying goes:
If I had more time, I would have written a shorter letter. 😉
d
If the tap can fully describe the nested objects structure, definitely an ‘object’ data type would make sense
So that means that it’s a bad idea to output
object
type element with a dictionary itself without any schema for nested elements? I tested the following model:
json with nested elements
tap-spreadsheets-anywhere
transferwise.target-postgres
Original JSON-file:
Copy code
[
{ "top_element": { "nested_element1": 1, "nested_element2": 2 },
{ "top_element": { "nested_element1": 3, "nested_element2": 3 },
]
The tap generated schema:
Copy code
"top_element": {
  "type": [
    "null",
    "object"
  ]
},
The target inferred table schema:
Copy code
"top_element" jsonb,
And finally stored json value in DB:
Copy code
{
  "nested_element1": 1,
  "nested_element2": 2
}
However, the flattening doesn’t work if the tap doesn’t define schema of nested elements.
Copy code
schema:
  my_json_stream:
    top_element:
      type: ["null", "object"]
      properties:
        nested_element1:
          type: ["null", "integer"]
        nested_element2:
          type: ["null", "integer"]
After I explicitly configured schema in the tap’s config and enabled flattening in the target config everything worked well and generated two table fields:
top_element__nested_element1
and
top_element__nested_element2
.
p
How about if the tap had a non-default option which during discovery declared nested objects either as
object
type or as a nested structure, and during sync emitted them as nested objects? That would allow them to be properly modeled by targets such pipelinewise-target-snowflake and pipelinewise-target-postgres which do know how to handle these objects. It’s true that JSON / JSONL are dissimilar to the traditional “spreadsheets” paradigm, but this tap has already expanded its scope to partially support these formats, so an option like this would allow it to support them more fully. And I think it’s quite useful for there to be a single multi-storage-system tap which supports a variety of file formats.
d
I pushed a naive implementation a while ago: https://github.com/ets/tap-spreadsheets-anywhere/pull/51, and use a fork in my projects. Also I’ve built separate taps for csv/json/etc with a common transfer lib codebase as a proof of concept. It showed that the format parsers codebase also mostly common between them. So probably it’s a good idea to keep them together in one tap. And I believe the option to regulate nested objects behavior should work as well as an recursive schema discovery (conversion.py), but it requires some research to implement it.