Question on parent-child streams within taps.. I ...
# best-practices
e
Question on parent-child streams within taps.. I wrote my API to favor batches of queries.. but when starting to utilize this parent-child relationship.. it seems to feed me 1 record at a time from the parent.. is there any thoughts how to build these records into child targeted batches to avoid opening and closing the connection to the api over and over?
my first attempt at solving this, I've started saving in the stream.. the size of the query.. that tells you how much you can expect to get back. and to wait before executing the query.. but this really doesn't work as meltanos runtime expects you to return on every get_record call.. and not to "batch" them up and return them all later
does anyone know the best practice to yield basically nothing.. until I get to the very last get record call.. then execute a batch query and collect the responses that way?
i will try just an empty array or JSON first
okay here's my workaround
Copy code
class NewsBodyStream(Stream):

    query_size_parent = 0
    query_array_parent = []
    ibkr_thrift_client = None

    def get_records(self, context: dict | None) -> t.Iterable[dict | tuple[dict, dict | None]]:
        <http://self.logger.info|self.logger.info>("CONFIG = " + str(self.config) + "parentStream " + str(self.parent_stream_type) + "schema " + str(self.schema) + "stream context = " + str(context))

        if self.query_size_parent == 0:
            self.query_size_parent = context['query_size']

        self.query_array_parent.append(context)

        # If this is true, we've gathered the entire upstream query and can proceed to execute
        if len(self.query_array_parent) == self.query_size_parent:
           # EXECUTE THE QUERY AS A BATCH AND YIELD RESULTS ONCE THEY COME BACK IN A LOOP
        else:
            yield {}
hmmm even that ended up not working.. and instead i had to shoe horn the entire parents result into an array.. and pass that as context.. the problem will inevitably be that I would repeat very large queries over and over for each parent record... I am now looking into the docs about batching config
I found these https://sdk.meltano.com/en/latest/classes/singer_sdk.Stream.html#singer_sdk.Stream.get_batch_config https://sdk.meltano.com/en/latest/classes/singer_sdk.batch.BaseBatcher.html Not sure how they work, but thanks to the pycharm debugger documentation I can FINALLY use a proper debugger and inspect the singer SDK step by step.. to get a hang of the life cycle
but for now I may have to settle for one-by-one passing each record to the service
Well.. I may in fact disentangle my tap unfortuantely.. or just not use parent-child relationships right now.. the behavior isn't clear but.. I would like to have the entire batch of results sent to the child .. instead of what's happening now.. every record calculated by the parent is being sent to the child 1 row at a time.. and my thrift server basically is written (poorly) to disconnect after a single API call to allow the other calls to take a client connection..
my one curiousity or hope is in re-reading the docs.. and I found this bit here
Copy code
If the number of parent items is very large (thousands or tens of thousands), you can optionally set state_partitioning_keys on the child stream to specify a subset of context keys to use in state bookmarks. (When not set, the number of bookmarks will be equal to the number of parent items.) If you do not wish to store any state bookmarks for the child stream, set state_partitioning_keys to [].
https://sdk.meltano.com/en/latest/parent_streams.html
will try to find a example of this somewhere
okay I think I have a path foward... based on this snippet in the SDK class doc string
Copy code
A method which should retrieve data from the source and return records
        incrementally using the python `yield` operator.
turns out I just thought you should yield in all cases.. I will now switch to return the data and see if that changes the behavior (fingers crossed)