Following up on <this thread> from a couple months...
# singer-tap-development
r
Following up on this thread from a couple months ago. Anyone have any thoughts?
a
In general, I think the two choices for best practice are: 1. Pick a consistent schema that can handle the most complex data needing to be captured. So, in the case that a string or an array of strings may be sent, conform bare string values to a 1-item string array. 2. Support both alternate formats under different (nullable) property names, so a singular string will be transmitted as
scope
in the example above and an array of strings would be transmitted as
scopes
or
scope_list
. In both of these approaches, the target gets values that can be deterministically stored according to column type. The first option is easier to be consumed downstream, since there's just a single scheme to handle when serializing. The second can be more efficient in terms of storage of the simpler types, but at the cost of needing to coalesce across the two columns and add logic for variant data in the downstream process. For a simple target like target-jsonl, this might seem inefficient, since json can easily handle variant types and doesn't have any internal optimizations that would rely on caring about data types. But for databases and targets that are columnarly compressed and tuned for big data and highly efficient IO (such as redshift, snowflake, and parquet), some kind of conformed schema is necessary to provide efficient queries - and simple string columns will have different compression/tuning techniques versus variant/json/array columns. Is this helpful at all?
This is just my own perspective, to be clear. Others might have different approaches.
r
Yeah, that's super helpful @aaronsteers - thank you! In the past, we have followed the approach you described as option 1. Personally, I'm more of a fan of this way, as it means that a single resource property will only ever be represented by one field for each record, even if the resulting data type does not align with the resource exactly. Either way, both cases can lead to pretty bulky
post_process
implementations, if you are trying to account for many fields of varying types. I'm wondering if it would be possible/a good idea for the SDK to be able to coerce values to types defined in a schema. For example... Say
scope
is defined in a schema like this:
Copy code
th.Property("scope", th.ArrayType(th.StringType))
If the SDK receives a record with
Copy code
{
  "scope": [
    "profile",
    "email",
    "connect"
  ]
}
scope
should be processed as-is. If the SDK receives a record with
Copy code
{
  "scope": "profile"
}
scope
should be coerced to an array:
Copy code
{
  "scope": [
    "profile"
  ]
}
If the SDK receives a record with
Copy code
{
  "scope": "profile email connect"
}
and there is some sort of coercion mapping function defined for the
ArrayType
property
Copy code
th.Property("scope", th.ArrayType(th.StringType, str.split))
scope
should be coerced to an array, using that function:
Copy code
{
  "scope": [
    "profile",
    "email",
    "connect"
  ]
}
Given this some more thought, since it didn't feel right to define coercion mappings in a schema. Maybe it works as a stream class property:
Copy code
jsonpath_property_mappings = {
    "path.to.scope": str.split,
}
Although, maybe it just fits better with Meltano in general as a
mapper
plugin?
a
maybe it just fits better with Meltano in general as a
mapper
plugin?
We generally wouldn't want the addition of another plugin in the flow to be required for normal workloads, so if we can make this work in a nice way for all users of the tap, I think that's idea. (Also more performant.) While handling every use case generically would probably not be feasible, I do think handling coersion from string to list of strings could be handled as a generic, something like the suggestion here:
If the SDK receives a record with
```{
"scope": "profile email connect"
}```
and there is some sort of coercion mapping function defined for the
ArrayType
property
Copy code
th.Property("scope", th.ArrayType(th.StringType, str.split))
scope
should be coerced to an array, using that function
Performance and implementation-wise there are still some things to figure out, such as whether this should run over every record or only if validation fails. And adding validation on each record/node would have some performance implications also, although perhaps this would be minimal. I could see this being 'validation exception handlers' per node (run only if data doesn't fit) or as pre-processors (run on every record). A few other things to work out also, such as whether these should run only if data exists at the node's path, or if they would run regardless.
r
We generally wouldn't want the addition of another plugin in the flow to be required for normal workloads,
Agreed - it feels like a bit of an anti-pattern to implement a tap/target-specific mapper. I've opened an issue for this now. Really appreciate your thoughts and feedback, AJ - thank you! 😁