Hi everyone - running into a situation where the J...
# singer-tap-development
d
Hi everyone - running into a situation where the JSON responses I get are all string types regardless of what the API docs say they're supposed to be. My worry is this is going to throw all sorts of schema validation errors when I run this with
meltano elt
. So my question is - is there a good pattern for type casting the data in transit based on the JSON schema or do I just accept that everything is going to land in the database as strings and deal with it later?
a
As a tap developer, if you're confident that a "string" result is always going to be an int - I'd say go ahead and: 1. Type it as an int in your JSON Schema (
schema
) declaration. 2. Convert it to actually be so in
post_process()
if using the SDK.
The challenge here is "what if we're wrong" - as this would create a hard failure for users at runtime. If there's a strong chance that the string will not be convertable to "int" (for instance), then the safe route is just be pessimistic and send the value(s) as string. This is a moderate headache for downstream transformations, but generally not too bad in my experience, since the first layer of the dbt/transform layer is generally type casting and naming conventions anyway.
Is this helpful at all? Others probably have their own experiences and suggestions, but this 👆 is how I generally think of it.
d
This is very helpful! I only have a small follow up question on how to access a stream's schema via the sdk objects so I can safely type cast using
post_process
. I'm going to test this out locally and see how bad it is but otherwise i'll probably take your suggestion and just deal with it in the database.
a
We have helpers to create JSON Schema dicts, but parsing them out is a harder task which we don't have helpers for as of now. Also, because
post_process()
runs per record, you'd likely get better performance with hard-coded transforms, versus a dynamic operation that determined which transformations by checking each field in the schema. So, if you know which properties you want to convert to ints, for example:
Copy code
def post_process(record, ...):
    for int_field in ["user_id", "num_widgets", "age"]:
        record[int_field] = int(record[int_field])
    return record
d
Ah okay makes sense I was hoping to do something like this but it sounds like the json schema isn't easily accessible from within the stream object
Copy code
def post_process(row, ...)
  types = self.json_schema
  for key in row.keys():
    if types['properties'][key]['type'] == 'integer':
      row[key] = int(row[key])
  return row
a
This would totally work, as long as you account for a few different ways the 'integer' type can be expressed - or if you control for it in the way you declare the schema. For instance, we often use 'type' of ['integer', 'null'] to indicate a nullable integer. And technically, there are also 'anyOf' operations.
Totally up to your discretion though. I can see this working - especially since you are the one declaring the schema - but just want to call out that parsing is sometimes messy 🙃