Hey there, I’m fairly new to the singer ecosystem,...
# troubleshooting
c
Hey there, I’m fairly new to the singer ecosystem, and I’m starting to get crazy with my setup and a little help might be worth it. I am trying to use the tap-github (singer-io variant) and the target-s3 loader. My tap-github config contains several repositories in it, and this tap iterates over every repository and outputs in stdout the schema and the records (singer spec duh) But I noticed something odd in the output of the tap, after each repo iteration the schema is slightly changing in the ordering of the type elements in each properties Here’s a gist of what’s changing
{"type":"SCHEMA","stream":"issues","schema":{"selected":true,"properties":{"url":{"type":["string","null"]}},"type":["object","null"]},"key_properties":["id"]}
v.s.
{"type":"SCHEMA","stream":"issues","schema":{"selected":true,"properties":{"url":{"type":["null","string"]}},"type":["null","object"]},"key_properties":["id"]}
The problem with that is during the loader execution, the loader tells that the schema changed, and ends up writing only the records after the last SCHEMA found in stdout.
Copy code
root@210c2ae8fa43:/meltano# cat foo.json | poetry run meltano invoke target-s3
2023-08-03T12:26:31.457374Z [info ] Environment 'dev' is active
2023-08-03 12:26:33,786 Target 'target-s3' is listening for input from tap.
2023-08-03 12:26:33,786 Initializing 'target-s3' target sink...
2023-08-03 12:26:33,786 Initializing target sink for stream 'issues'...
2023-08-03 12:26:34,474 Schema has changed for stream 'issues'. Mapping definitions will be reset.
2023-08-03 12:26:34,474 Schema or key properties for 'issues' stream have changed. Initializing a new 'issues' sink...
2023-08-03 12:26:34,474 Initializing 'target-s3' target sink...
2023-08-03 12:26:34,474 Initializing target sink for stream 'issues'...
Does this ring a bell to anyone, and would you have a clue how to fix this stupid list ordering issues?
v
Sounds like a bug in the target to me, you could try a target like target-postgres to see if it handles that alright or not. Also a different github variant may not swap schemas for you and bypass the target bug 🤷
u
Yeah that sounds like something the SDK should try to handle. Right now its doing a dict comparison which probably isnt accounting for ordering of sub lists https://github.com/meltano/sdk/blob/6ce7eabcf7dd29c1aebf6d97bdd04d63b8105a4c/singer_sdk/target_base.py#L172C52-L172C52. I wonder if something like https://github.com/seperman/deepdiff could help handle ordering differences cc @edgar_ramirez_mondragon
c
Yeah the SDK should probably handle better this situation, definitely! But I still don’t really understand why would the tap print unordered schemas, as they are coming from hardcoded json files afaik. Should I open an issue on the tap repo to ask them?
e
Should I open an issue on the tap repo to ask them?
Yeah, I think it's worth asking folks over there.
Yeah the SDK should probably handle better this situation, definitely!
Worth logging an issue. I like Pat's proposed solution since it seems to be able to ignore order and duplicates in lists, but I'd worry about performance (probably not a big deal!)
v
fwiw I'm pretty sure both the tap and target here aren't built with the sdk
e
Ah, that's right
u
I think target-s3 is SDK based
v
nice looks like they revamped that target since last time I dove in!