In <https github com Widen tap rest api msdk|tap rest api ms Meltano #singer-tap-development

In <tap-rest-api-msdk> there is a section in the t...

steve_clarke

11/21/2023, 6:29 PM

In tap-rest-api-msdk there is a section in the tap to dynamically discover a schema based on the data returned from the API - https://github.com/Widen/tap-rest-api-msdk/blob/4f87c1adae00446388ebbe418c70b87c231856dc/tap_rest_api_msdk/tap.py#L532-L609. If the data returned by the API is consistent, i.e. the same number of attributes in every record returned, the schema dynamic discovery works well. If however the API has certain attributes which are optional, and therefore only appear in certain records it can lead to data loss. Specifically I have seen where the random sampling of a certain number of records to determine the schema may miss certain attributes. When the tap is emitting records, Meltano will ignore any attributes that weren't part of the original schema. Now one way to solve this is to supply a schema manually and not do the automatic discovery. The problem is if a new attribute is added to the API this will result in the same problem if this is missed. So I am wondering is there a way if a new attribute is discovered to be missing in the schema whether it could be handled gracefully e.g. perhaps a further schema message is issued to alter the table (I am assuming a database target - snowflake) so that the attribute is not lost? Keen to get some thoughts on this as I really love the dynamic discovery. For now I am emitting the raw json record and unpacking it in dbt, but thought there must be a better way.

visch

11/21/2023, 10:02 PM

We messed with an idea we called "enveloping" where you wrap the record in an object. See https://github.com/MeltanoLabs/tap-universal-file and

jsonl_type_coercion_strategy

There's negatives to a fully dynamic schema but sometimes it's what you want

steve_clarke

11/21/2023, 10:08 PM

Thanks @visch, I will take a look at this. It is one of those things where statically defining a schema can lead to missing fields because you haven't been informed a new one is introduced, and dynamic schema generation can miss stuff because certain fields are optional and your sample has missed an optional field. To quote Eleanor Roosevelt. “Do what you feel in your heart to be right - for you'll be criticized anyway. You'll be damned if you do, and damned if you don't.”

visch

11/21/2023, 10:10 PM

because you haven't been informed a new one is introduced

Generally with meltano's select

and a good dynamic query tool you're fine (It works all the time with databases)

dynamic schema generation can miss stuff because certain fields are optional and your sample has missed an optional field.

Yeah this one is brutal for some API's. Luckily most don't have you guessing too much It's also why I lean towards statically defined schemas for APIs as well see clickup, which doesn't define too much in their API docs but even though we statically defined things and miss some once in a while added a field now and then isn't a huge issue

steve_clarke

11/21/2023, 10:25 PM

Let's say I statically define a schema - it will fix the optional fields. In a few months, the API Vendor introduces a new field and I doing a select *, it will miss data where that field is being populated. For me in this scenario I'm incrementally ingesting 100,000's of records on a daily basis - it is likely to matter if I find this later on. I wonder if we could optionally have the pipeline fail if it detects a record being emitted and a field wasn't in the schema. This would give you an opportunity to fix the schema and re-start the process? Currently it is just a warning in the Meltano Log.

visch

11/22/2023, 12:51 AM

We could easily add a "strict" mode in the sdk Instead of just warning here https://github.com/meltano/sdk/blob/main/singer_sdk/helpers/_typing.py#L385-L388

steve_clarke

11/22/2023, 11:30 PM

Thank you for this explanation Derek, @edgar_ramirez_mondragon what are your thoughts about this topic and perhaps having an option for a strict mode in the SDK? I do believe the strict mode should be an option and it is up to the tap developer to choose whether the unmapped_properties error or warn. It is my thoughts that perhaps the default is warn. There could be a setting set in the tap by the developer which would follow a different conditional path in the code and it would error if unmapped properties are discovered?

edgar_ramirez_mondragon

11/22/2023, 11:53 PM

By all means do log a feature request! Is the idea that as a user I may benefit by catching when a new field was introduced to the source's response so I can act accordingly (either nag the tap developer to update the schema or use a catalog override)?

steve_clarke

11/23/2023, 12:01 AM

Yes, that is my thoughts. My concern is about data loss as we are missing a new field which was introduced - which may be important. I will log a feature request - thank you. I am happy to have a go at creating a PR for it if it is approved but may need some guidance on the name of variable you would test to see if you should error rather than warn.

edgar_ramirez_mondragon

11/23/2023, 6:22 PM

fyi: https://github.com/meltano/sdk/issues/2068

Open in Slack

Previous Next