emcp
09/07/2021, 6:14 PMaaronsteers
09/07/2021, 9:09 PMStreams.partitions
property to be the list of shards you'd like to be able to run as distinct contexts. That would get your tap keep separate STATE trackers for each partition, and to pass the partition dict as context
to any method supporting it.
2. On the tap side also, you can accept a config option that's a list of one or more partitions to run. When omitted, presumably, you'd run all the partitions defined above. However, when set by the user, you'd run only the partition(s) requested.
3. On the orchestration side (for example, from Meltano), you would kick off any number of runners each with distinct partition config being passed. For instance, if there are 50 partitions, I could send a list of 5 partitions each to 10 invocations, or 10 partitions each to 5 invocations. You could also send 1 each partition to each of 50 invocations.
4. Lastly, you'd need a plan to manage state - either keeping separate state tracking for each of the 5 or 10 invocations you'd run - or else coming up with a more advanced way of merging back the states afterwards.
Does this help at all? Would be a good discussion for #C01QS0RV78D also if you are free tomorrow.emcp
09/08/2021, 7:20 AMaaronsteers
09/08/2021, 3:25 PMemcp
09/08/2021, 3:32 PMemcp
09/12/2021, 9:45 AM26 choose 1 + 26 choose 2 + 26 choose 3 + 26 choose 4 + 26 choose 5
or
83681 different queries
Maybe a can try splitting it up say.. query first half in 1 partition.. 2nd half in 2nd partition .. as a start.
I did notice that.. data doesn't get written until the completion of the transaction.. and if there's any error the whole tap run gets basically wiped out.. which to me signals that I will improve performance/reliability by splitting it up even more..emcp
09/12/2021, 9:58 AMstreams.py
?emcp
09/12/2021, 10:06 AMstreams.py
I've added
partitions = [{"dictkey`": "dict1value1"},
{"dictkey`": "dict1value2"}]
and in the client I can now see this if I print it out
print("stream context = " + str(context))
emcp
09/12/2021, 10:07 AMstream context = {'dictkey`': 'dict1value1'}
so now it's simply up to me to arbitrarily choose how to split this up into partitions .. ! okay greatemcp
09/12/2021, 10:08 AMpartitions = [{"dictkey`": ["a", 'b']},
{"dictkey`": ["c", 'd']}]
emcp
09/12/2021, 10:40 AMtap-custom | time=2021-09-12 10:39:00 name=tap-custom level=WARNING message=Property 'alphabet_partition' was present in the 'my_table' stream but not found in catalog schema. Ignoring.
emcp
09/12/2021, 2:09 PM