Hi everyone running into a situation where the JSON response Meltano #singer-tap-development

Hi everyone - running into a situation where the J...

daniel_luftspring

01/21/2022, 3:51 PM

Hi everyone - running into a situation where the JSON responses I get are all string types regardless of what the API docs say they're supposed to be. My worry is this is going to throw all sorts of schema validation errors when I run this with

meltano elt

. So my question is - is there a good pattern for type casting the data in transit based on the JSON schema or do I just accept that everything is going to land in the database as strings and deal with it later?

aaronsteers

01/21/2022, 4:47 PM

As a tap developer, if you're confident that a "string" result is always going to be an int - I'd say go ahead and: 1. Type it as an int in your JSON Schema (

schema

) declaration. 2. Convert it to actually be so in

post_process()

if using the SDK.

aaronsteers

01/21/2022, 4:49 PM

The challenge here is "what if we're wrong" - as this would create a hard failure for users at runtime. If there's a strong chance that the string will not be convertable to "int" (for instance), then the safe route is just be pessimistic and send the value(s) as string. This is a moderate headache for downstream transformations, but generally not too bad in my experience, since the first layer of the dbt/transform layer is generally type casting and naming conventions anyway.

aaronsteers

01/21/2022, 4:50 PM

Is this helpful at all? Others probably have their own experiences and suggestions, but this 👆 is how I generally think of it.

daniel_luftspring

01/21/2022, 5:07 PM

This is very helpful! I only have a small follow up question on how to access a stream's schema via the sdk objects so I can safely type cast using

post_process

. I'm going to test this out locally and see how bad it is but otherwise i'll probably take your suggestion and just deal with it in the database.

aaronsteers

01/21/2022, 5:27 PM

We have helpers to create JSON Schema dicts, but parsing them out is a harder task which we don't have helpers for as of now. Also, because

post_process()

runs per record, you'd likely get better performance with hard-coded transforms, versus a dynamic operation that determined which transformations by checking each field in the schema. So, if you know which properties you want to convert to ints, for example:

Copy code

def post_process(record, ...):
    for int_field in ["user_id", "num_widgets", "age"]:
        record[int_field] = int(record[int_field])
    return record

daniel_luftspring

01/21/2022, 5:35 PM

Ah okay makes sense I was hoping to do something like this but it sounds like the json schema isn't easily accessible from within the stream object

Copy code

def post_process(row, ...)
  types = self.json_schema
  for key in row.keys():
    if types['properties'][key]['type'] == 'integer':
      row[key] = int(row[key])
  return row

aaronsteers

01/21/2022, 5:53 PM

This would totally work, as long as you account for a few different ways the 'integer' type can be expressed - or if you control for it in the way you declare the schema. For instance, we often use 'type' of ['integer', 'null'] to indicate a nullable integer. And technically, there are also 'anyOf' operations.

aaronsteers

01/21/2022, 5:54 PM

Totally up to your discretion though. I can see this working - especially since you are the one declaring the schema - but just want to call out that parsing is sometimes messy 🙃

Open in Slack

Previous Next