question on contexts in taps.. I saw the suggestio...
# singer-tap-development
e
question on contexts in taps.. I saw the suggestion that it is for when you have like.. shards of a dataset.. I think I have that situation now.. I have an endpoint which I have to query with all permutations of the alphabet from 1 to 5 letters.. it then compiles a response for that set of letters and this takes about 24 hours to run... I see it running now but I have coded to run the calculation of the permutations and the entire set of multiple queries within the single "Tap" call>> Im wondering are there good examples of this use of context for queries? should I perhaps lean towards splitting up my tap by using these alphabet permutations as the "slice" of the datasets response? Basically I either hand squash.. or push to postgresql via python by hand (until just now when I started using my tap) and I get rid of duplicates myself.. but maybe I can start to just let the context calls send the raw response and postgresql handles the rest? Please excuse any ignorance.. I am sort of a noob on postgresql and the data engineering side but super happy to see my tap going along.. will wait and see how it goes , it takes about 24 hours to finish
a
There are a couple layers to this question - and it's a great topic. 1. Ignoring orchestration for a second, on the tap stream side, you can override
Streams.partitions
property to be the list of shards you'd like to be able to run as distinct contexts. That would get your tap keep separate STATE trackers for each partition, and to pass the partition dict as
context
to any method supporting it. 2. On the tap side also, you can accept a config option that's a list of one or more partitions to run. When omitted, presumably, you'd run all the partitions defined above. However, when set by the user, you'd run only the partition(s) requested. 3. On the orchestration side (for example, from Meltano), you would kick off any number of runners each with distinct partition config being passed. For instance, if there are 50 partitions, I could send a list of 5 partitions each to 10 invocations, or 10 partitions each to 5 invocations. You could also send 1 each partition to each of 50 invocations. 4. Lastly, you'd need a plan to manage state - either keeping separate state tracking for each of the 5 or 10 invocations you'd run - or else coming up with a more advanced way of merging back the states afterwards. Does this help at all? Would be a good discussion for #C01QS0RV78D also if you are free tomorrow.
e
absolutely nailed the question AJ.. I will dive into these things in the docs.. if you have one favorite tap in mind which exhibits these behaviors you’ve talked about I’d love to just get the name and I can review the source from there.. otherwise I will finish watching the aethena tap video later this week and come back to this .. but it sounds like a plan.
a
@emcp Awesome. Unfortunately I am not aware of any great sharding/slicing samples out in the wild. If you do build one, I'd love to showcase it as a sample. Also, like I said above, this would make a fantastic #C01QS0RV78D discussion if you wanted to join us for a live discussion - before or after you start building it. 🙂
e
I just finished work so.. need to go out and fetch some things today.. but let me make some progress to educate myself on your pointers and hopefully next office hours it can be really fruitful and I get the easy stuff sorted
small update @aaronsteers did a google and found the following taps.. I will take a look but if there's other areas I will also bring those in.. I find that the manual config option just won't scale for me since.. we're talking about a static list of
Copy code
26 choose 1 + 26 choose 2 + 26 choose 3 + 26 choose 4 + 26 choose 5 

or

83681 different queries
Maybe a can try splitting it up say.. query first half in 1 partition.. 2nd half in 2nd partition .. as a start. I did notice that.. data doesn't get written until the completion of the transaction.. and if there's any error the whole tap run gets basically wiped out.. which to me signals that I will improve performance/reliability by splitting it up even more..
okay i think I've found the SDK docs about this.. https://sdk.meltano.com/en/latest/partitioning.html I will try it now.. I'm still quite a noob but I guess this is done in the
streams.py
?
ah I got it... so in
streams.py
I've added
Copy code
partitions = [{"dictkey`": "dict1value1"},
                  {"dictkey`": "dict1value2"}]
and in the client I can now see this if I print it out
Copy code
print("stream context = " + str(context))
Copy code
stream context = {'dictkey`': 'dict1value1'}
so now it's simply up to me to arbitrarily choose how to split this up into partitions .. ! okay great
simplistically this then can start to look like
Copy code
partitions = [{"dictkey`": ["a", 'b']},
                  {"dictkey`": ["c", 'd']}]
I tried now, invoking from the GUI.. but in the logs saw this warning..
Copy code
tap-custom        | time=2021-09-12 10:39:00 name=tap-custom level=WARNING message=Property 'alphabet_partition' was present in the 'my_table' stream but not found in catalog schema. Ignoring.
okay I am googling around and it seems.. I need to work on adding the catalog discovery somehow to my custom tap.. so that meltano can understand how to split up the queries .. but when it doesn't get this stuff fed do it.. it just runs the entire thing as one .. So I am now trying to figure out what a catalog file is.. and how to fill it in with my partitions.. I see some examples but none place the partitions data anywhere.. and I am I bit lost if I need to put JUST the partition data or if I need to re-describe everything about my taps stream https://hub.meltano.com/singer/spec#catalog-files