I’m creating a custom tap for an API that has inco...
# troubleshooting
a
I’m creating a custom tap for an API that has inconsistent data type, the field is an array of strings, but if it has never been set (for some reason) the API returns a
false
. What is the “most correct” way to handle a case like this? I’ve made some tests with stream_maps to try to map the false rows to an empty array, but the simpleeval does not allow to perform operations with lists. As a partial solution, I’m doing this check on post_process but as I have many similar fields I’m not sure if this is the way to go.
This is a response example
Copy code
{
    "result": [
        {
            "ID": "59",
            "UF_CRM_1689847377": [
                "one value",
                "another value"
            ]
        },
        {
            "ID": "73",
            "UF_CRM_1689847377": false
        },
        {
            "ID": "77",
            "UF_CRM_1689847377": []
        },
    ]
}
v
I think you can do
type: ["array", "boolean"]
If you have an opinion on how it should be modeled then I think
post_process
is a good option as well (make false be an empty array or something)
a
I think you can do
type: ["array", "boolean"]
I considered this option, it seems all records were then mapped to either true or false, so even the records that had the proper array turned out with wrong values 😕
If you have an opinion on how it should be modeled then I think
post_process
is a good option as well (make false be an empty array or something)
Yeah, that’s the way I’m headed right now
v
I considered this option, it seems all records were then mapped to either true or false, so even the records that had the proper array turned out with wrong values 😕
That doesn't make sense, can you elaborate?
a
I was just afraid I could be basically “re-writting” the schema validation allover again
That doesn’t make sense, can you elaborate?
Sure! Let me try to give more context. I’m attaching an image of the API response I’m getting using postman, showing some entities, including examples with the array of strings, empty array, and boolean. When I use the schema for the tap with this configuration:
Copy code
{
  "properties": {
    "ID": {
      "type": [
        "string"
      ],
      "description": "ID"
    },
    "UF_CRM_1689847377": {
      "type": [
        "array",
        "boolean"
      ],
      "items": {
        "type": [
          "string"
        ]
      }
    }
  }
}
This is how it is being mapped after going through target-jsonl:
Copy code
{"ID": "59", "UF_CRM_1689847377": true}
{"ID": "73", "UF_CRM_1689847377": false}
{"ID": "77", "UF_CRM_1689847377": true}
{"ID": "111", "UF_CRM_1689847377": false}
{"ID": "129", "UF_CRM_1689847377": false}
{"ID": "191", "UF_CRM_1689847377": false}
{"ID": "305", "UF_CRM_1689847377": false}
------- When I go back to using only
"type": ["array"]
instead of
"type": ["array", "boolean"]
:
Copy code
{
  "properties": {
    "ID": {
      "type": [
        "string"
      ],
      "description": "ID"
    },
    "UF_CRM_1689847377": {
      "type": [
        "array"
      ],
      "items": {
        "type": [
          "string"
        ]
      }
    }
  }
}
And with post-processing:
Copy code
def post_process(self, row: dict, context: Optional[dict]) -> dict:
        if row['UF_CRM_1689847377'] == False:
            row['UF_CRM_1689847377'] = []
        return row
The result is as I expected:
Copy code
{"ID": "59", "UF_CRM_1689847377": ["one value", "another value"]}
{"ID": "73", "UF_CRM_1689847377": []}
{"ID": "77", "UF_CRM_1689847377": []}
{"ID": "111", "UF_CRM_1689847377": []}
{"ID": "129", "UF_CRM_1689847377": []}
{"ID": "191", "UF_CRM_1689847377": []}
{"ID": "305", "UF_CRM_1689847377": []}
v
When I use the schema for the tap with this configuration:
How does that tap read that. Are you passing in via the catalog or directly as the schema for your stream?
There's a bug somewhere is my point
e
I think this schema
Copy code
{
  "properties": {
    "ID": {
      "type": [
        "string"
      ],
      "description": "ID"
    },
    "UF_CRM_1689847377": {
      "type": [
        "array",
        "boolean"
      ],
      "items": {
        "type": [
          "string"
        ]
      }
    }
  }
}
results in boolean values because of how we conform any boolean property in the schema: https://github.com/meltano/sdk/blob/f6bbf0c5ddba689ee1ab6df1f110529f35adf12f/singer_sdk/helpers/_typing.py#L470-L492. One solution for your tap may be to disable conforming, ie setting
TypeConformanceLevel.NONE
in the stream(s): https://sdk.meltano.com/en/v0.31.0/classes/singer_sdk.Stream.html#singer_sdk.Stream.TYPE_CONFORMANCE_LEVEL I personally like the
post_process
approach better, specially if there's a way to tell which props need to be processed beforehand, eg they have a special prefix.
a
How does that tap read that. Are you passing in via the catalog or directly as the schema for your stream?
Right now I’m using directly as the schema of the stream, but eventually I’ll extract it to be passed via catalog.
specially if there’s a way to tell which props need to be processed beforehand
Unfortunatelly I don’t think that’s the case, as I’m extracting data from a highly customizable CRM so I don’t think I can assume a prefix will be always present. I was only concerned about doing it on the post_process because there can be dozens (even hundreds) of fields on this condition 😕
Thanks for the help so far by the way guys! 🙏
r
We've bumped into this exact issue in the past and just ended up using
post_process
, as we found there isn't really a better solution available at the moment. Check out this PR to
tap-auth0
for context. I also made this issue, which contains links off to some Slack discussions and related issues - I'd be interested in your thoughts after having a look through those. 🙂