Reuben (Matatika)
07/03/2022, 1:03 AMaaronsteers
07/03/2022, 1:37 AMscope
in the example above and an array of strings would be transmitted as scopes
or scope_list
.
In both of these approaches, the target gets values that can be deterministically stored according to column type.
The first option is easier to be consumed downstream, since there's just a single scheme to handle when serializing. The second can be more efficient in terms of storage of the simpler types, but at the cost of needing to coalesce across the two columns and add logic for variant data in the downstream process.
For a simple target like target-jsonl, this might seem inefficient, since json can easily handle variant types and doesn't have any internal optimizations that would rely on caring about data types. But for databases and targets that are columnarly compressed and tuned for big data and highly efficient IO (such as redshift, snowflake, and parquet), some kind of conformed schema is necessary to provide efficient queries - and simple string columns will have different compression/tuning techniques versus variant/json/array columns.
Is this helpful at all?aaronsteers
07/03/2022, 1:40 AMReuben (Matatika)
07/03/2022, 3:32 AMpost_process
implementations, if you are trying to account for many fields of varying types. I'm wondering if it would be possible/a good idea for the SDK to be able to coerce values to types defined in a schema. For example...
Say scope
is defined in a schema like this:
th.Property("scope", th.ArrayType(th.StringType))
If the SDK receives a record with
{
"scope": [
"profile",
"email",
"connect"
]
}
scope
should be processed as-is.
If the SDK receives a record with
{
"scope": "profile"
}
scope
should be coerced to an array:
{
"scope": [
"profile"
]
}
If the SDK receives a record with
{
"scope": "profile email connect"
}
and there is some sort of coercion mapping function defined for the ArrayType
property
th.Property("scope", th.ArrayType(th.StringType, str.split))
scope
should be coerced to an array, using that function:
{
"scope": [
"profile",
"email",
"connect"
]
}
Reuben (Matatika)
07/03/2022, 1:42 PMjsonpath_property_mappings = {
"path.to.scope": str.split,
}
Although, maybe it just fits better with Meltano in general as a mapper
plugin?aaronsteers
07/05/2022, 6:13 PMmaybe it just fits better with Meltano in general as aWe generally wouldn't want the addition of another plugin in the flow to be required for normal workloads, so if we can make this work in a nice way for all users of the tap, I think that's idea. (Also more performant.) While handling every use case generically would probably not be feasible, I do think handling coersion from string to list of strings could be handled as a generic, something like the suggestion here:plugin?mapper
If the SDK receives a record with
```{
"scope": "profile email connect"
}```
and there is some sort of coercion mapping function defined for thepropertyArrayType
Copy codeth.Property("scope", th.ArrayType(th.StringType, str.split))
Performance and implementation-wise there are still some things to figure out, such as whether this should run over every record or only if validation fails. And adding validation on each record/node would have some performance implications also, although perhaps this would be minimal. I could see this being 'validation exception handlers' per node (run only if data doesn't fit) or as pre-processors (run on every record). A few other things to work out also, such as whether these should run only if data exists at the node's path, or if they would run regardless.should be coerced to an array, using that functionscope
Reuben (Matatika)
07/05/2022, 11:08 PMWe generally wouldn't want the addition of another plugin in the flow to be required for normal workloads,Agreed - it feels like a bit of an anti-pattern to implement a tap/target-specific mapper. I've opened an issue for this now. Really appreciate your thoughts and feedback, AJ - thank you! 😁
Reuben (Matatika)
11/30/2022, 11:28 PM