Hello, we're encountering problems while handling ...
# singer-tap-development
s
Hello, we're encountering problems while handling a nested field that is not present in every record. We're using "typing" and building a schema with PropertiesList for a stream named "orders" The field which is raising the warning is named "OrderTotal" and has two child-entries "CurrencyCode" and "Amount". Our structure is the following:
schema = th.PropertiesList(
th.Property("...", th.StringType),
...
th.Property("OrderTotal", th.ObjectType(
th.Property("CurrencyCode", th.StringType),
th.Property("Amount", th.NumberType)
)
),
...
)
When we run the tap, we receive the two following warnings, only once per run:
time=2021-06-23 13:30:20 name=tap-test level=WARNING message=Catalog entry missing for 'orders':'('properties', 'OrderTotal', 'properties', 'CurrencyCode')'. Using parent value of selected=True.
time=2021-06-23 13:30:20 name=tap-test level=WARNING message=Catalog entry missing for 'orders':'('properties', 'OrderTotal', 'properties', 'Amount')'. Using parent value of selected=True.
If we create an ELT with "target-bigquery" the table row is correctly created with a nested field, but both "CurrencyCode" and "Amount" are NULL, while if we use "target-csv" we see the "flattened" values, but when the nested field is not present, there are no NULL values and the fields get skipped entirely, breaking the csv. Even using a static Json file gives the same result:
{
...,
"OrderTotal": {"type": ["object", "null"],
"properties": {
"CurrencyCode": {"type": ["string", "null"]},
"Amount": {"type": ["number", "null"]}
}
},
...
}
Could you help us identify the underlying issue?
a
However, that issue mostly is related to the excessive warning logs. In your case, it seems the selection itself is not functioning as expected. I'm taking a closer look now...
After closer inspection of the symptoms you describe, it looks like there might be a mismatch between the schema and record structure.
If we create an ELT with "target-bigquery" the table row is correctly created with a nested field
This indicates schema is being passed correctly.
but both "CurrencyCode" and "Amount" are NULL...
The differences in the CSV loading may be an issue/symptom of the target CSV. @stefano_nicolai - Can you try with target-jsonl and let me know if you get the correct structure in a jsonl output?
Also - in case of a mismatch of record and schema, you should see a message like:
"Property '{property_name}' was present in the '{stream_name}' stream but not found in catalog schema. Ignoring."
Can you check your logs for a substring of this text, to see if there are any cases on unmapped properties when mapping record data and schema?
s
Hi @aaronsteers, thank you for the quick reply. We did more testing, following your suggestions, and our findings are the following: - there is no message like "Property '{property_name}' was present in the '{stream_name}' [...]" in the logs, or any "Property" or "Ignoring" keywords. - the Jsonl output seems consistent with the schema and the records with a 1-to-1 match of the source values We think the main problem is the lack of the optional nested field "OrderTotal" in some of the source records, and how the tap is handling them. If the tap is replicating the source structure, hence entirely skipping some values, the schema is not respected and this may cause all the problems we are encountering (column missmatch in the csv, correct schema but with null values in bigquery) What are your thoughts about this? Should we tackle this problem on our side, or is there any function/procedure already in place that we may have overlooked? If can be of any help we can share an anonymized version of the whole log or a sub-sample of the schema and records produced.
a
@stefano_nicolai - Thanks for getting back to us. I'm not entirely sure I understand this part, or what might cause that behavior:
We think the main problem is the lack of the optional nested field "OrderTotal" in some of the source records, and how the tap is handling them.
And yes, to your question, it would be very helpful to see the schema definition and logic to populate a record. Can you share those here or in a an issue in the SDK issue tracker? (If this is a public repo, you can also share a link.)
s
@aaronsteers thanx for the reply, apologies for the late andwer. We don't have a public repo yet, but we can share a portion of log to highlight the issue. As you can see one of the third order (xx3-xxxxxxx-xxxxxxx) we receive from the Amazon Api is missing the whole "OrderTotal" field, including the two nested properties "CurrencCode" and "Amount". This produces the warnings and the errors we are encountering both with target-csv and target-bigquery (pipelinewise, apparently the adswerve version handles this correctly, but has other bugs).
time=_2021-06-28_
_08:18:48_ name=tap-amazon level=INFO message=tap-amazon v0.0.1, Singer SDK v0.2.0)
time=_2021-06-28_
_08:18:48_ name=tap-amazon level=INFO message=Skipping parse of env var settings...
time=_2021-06-28_
_08:18:48_ name=tap-amazon level=INFO message=Config validation passed with 0 errors and 0 warnings.
time=_2021-06-28_
_08:18:48_ name=tap-amazon level=INFO message=Beginning incremental sync of 'orders'...
time=_2021-06-28_
_08:18:48_ name=tap-amazon level=WARNING message=Catalog entry missing for 'orders':'('properties', 'OrderTotal', 'properties', 'CurrencyCode')'. Using parent value of selected=True.
time=_2021-06-28_
_08:18:48_ name=tap-amazon level=WARNING message=Catalog entry missing for 'orders':'('properties', 'OrderTotal', 'properties', 'Amount')'. Using parent value of selected=True.
{"type": "SCHEMA", "stream": "orders", "schema": {"properties": {"AmazonOrderId": {"type": ["string", "null"]}, "PurchaseDate": {"format": "date-time", "type": ["string", "null"]}, "LastUpdateDate": {"format": "date-time", "type": ["string", "null"]}, "OrderStatus": {"type": ["string", "null"]}, "SellerOrderId": {"type": ["string", "null"]}, "FulfillmentChannel": {"type": ["string", "null"]}, "SalesChannel": {"type": ["string", "null"]}, "ShipServiceLevel": {"type": ["string", "null"]}, "OrderTotal": {"properties": {"CurrencyCode": {"type": ["string", "null"]}, "Amount": {"type": ["number", "null"]}}, "type": ["object", "null"]}, "NumberOfItemsShipped": {"type": ["integer", "null"]}, "NumberOfItemsUnshipped": {"type": ["integer", "null"]}, "PaymentMethod": {"type": ["string", "null"]}, "PaymentMethodDetails": {"items": {"type": ["string"]}, "type": ["array", "null"]}, "IsReplacementOrder": {"type": ["boolean", "null"]}, "MarketplaceId": {"type": ["string", "null"]}, "ShipmentServiceLevelCategory": {"type": ["string", "null"]}, "OrderType": {"type": ["string", "null"]}, "EarliestShipDate": {"format": "date-time", "type": ["string", "null"]}, "LatestShipDate": {"format": "date-time", "type": ["string", "null"]}, "IsBusinessOrder": {"type": ["boolean", "null"]}, "IsSoldByAB": {"type": ["boolean", "null"]}, "IsPrime": {"type": ["boolean", "null"]}, "IsGlobalExpressEnabled": {"type": ["boolean", "null"]}, "IsPremiumOrder": {"type": ["boolean", "null"]}, "IsISPU": {"type": ["boolean", "null"]}}, "type": "object"}, "key_properties": ["AmazonOrderId"], "bookmark_properties": ["LastUpdateDate"]}
{"type": "RECORD", "stream": "orders", "record": {"AmazonOrderId": "xx1-xxxxxxx-xxxxxxx", "PurchaseDate": "2021-01-01T00:00:00Z", "LastUpdateDate": "2021-01-01T00:00:00Z", "OrderStatus": "Shipped", "SellerOrderId": "xx1-xxxxxxx-xxxxxxx", "FulfillmentChannel": "XXX", "SalesChannel": "<http://Amazon.com|Amazon.com>", "ShipServiceLevel": "Expedited", "OrderTotal": {"CurrencyCode": "EUR", "Amount": "10.01"}, "NumberOfItemsShipped": 1, "NumberOfItemsUnshipped": 0, "PaymentMethod": "Other", "PaymentMethodDetails": ["Standard"], "IsReplacementOrder": true, "MarketplaceId": "XXXXXXXXXXXXX", "ShipmentServiceLevelCategory": "Expedited", "OrderType": "StandardOrder", "EarliestShipDate": "2021-01-01T00:00:00Z", "LatestShipDate": "2021-01-01T00:00:00Z", "IsBusinessOrder": false, "IsSoldByAB": false, "IsPrime": false, "IsGlobalExpressEnabled": false, "IsPremiumOrder": false, "IsISPU": false}, "time_extracted": "2021-01-01T00:00:00Z"}
`{"type": "STATE", "value": {"bookmarks": {"ord…
Probably our best approach is to write a recursive function that makes sure every key in the SCHEMA is present inside the RECORD, creating NULL values where it's needed, making our tap indipendent from the target, hence not breaking the csv.
a
@stefano_nicolai - It sounds like that approach would probably meet your objective. Along the way in this discussion, I've been to identify if this is a bug in the SDK or in the target - seems like the SDK is working correctly, in as much that it is meeting the spec and sending downstream data that at least some targets can understand. It sounds like this is not necessarily a bug in the SDK but, that said, the SDK seems to generate output which at least 2 targets did not expect (read: "do not deal well with") - which was fully empty/omitted objects in the stream. Perhaps this is a symptom of missing not just a leaf property but specifically with missing an item which itself is a parent to other properties? Based on your exploration, does it sound like this paragraph 👆 properly describes the issue? If yes, I do think the fix/workaround you describe would be an effective fix, and I think it's worth also considering as a permanent contribution back into the SDK. Perhaps would be expressed like:
Ensure null parent properties are aways created.
or going all the way to the leaf items:
Full property tree is created even when nodes are empty/null.
@douwe_maan - FYI, in case you have additional insight on this. Updating the output could be seen as "incorrect" when those mid-level nodes simply do not exist, but targets and pipelines are generally focused on the leaf properties anyway, so it might still be preferrable to create the full record structure in order to improve downstream parsing.
d
I wonder if it makes a difference for these targets whether
OrderTotal
is missing entirely or whether it is present with a
null
value, The schema allows
null
as an alternative to
object
, and I’d expect a missing key to be interpreted the same, but perhaps these targets care. Either way, it’s technically a bug in the targets (and I think it deserves an issue there, especially if it’s fixed in the adswerve variant but not the pipelinewise ones), but if the SDK can do something extra (that isn’t too expensive) to fill in the missing fields with explict
null
s, that could make sense.
s
Thank you for your answer @aaronsteers. While i fully agree with @douwe_maan that having a missing nullable node should be handled from the targets, i don't think it would hurt to have a function that adds null on each missing leaf to assure maximum compatibility with all the different available targets. I wonder if this approach shoul belong to the transformation layer, since it can be seen as an operation on the data source. FYI we tested target csv while having a null value just for the main node, but it still fails to create the null entries for the two leafs.