Question on parent child streams within taps I wrote my API Meltano #best-practices

Question on parent-child streams within taps.. I ...

emcp

05/25/2023, 8:07 PM

Question on parent-child streams within taps.. I wrote my API to favor batches of queries.. but when starting to utilize this parent-child relationship.. it seems to feed me 1 record at a time from the parent.. is there any thoughts how to build these records into child targeted batches to avoid opening and closing the connection to the api over and over?

emcp

05/26/2023, 3:34 PM

my first attempt at solving this, I've started saving in the stream.. the size of the query.. that tells you how much you can expect to get back. and to wait before executing the query.. but this really doesn't work as meltanos runtime expects you to return on every get_record call.. and not to "batch" them up and return them all later

emcp

05/26/2023, 3:37 PM

does anyone know the best practice to yield basically nothing.. until I get to the very last get record call.. then execute a batch query and collect the responses that way?

emcp

05/26/2023, 3:37 PM

i will try just an empty array or JSON first

emcp

05/26/2023, 3:45 PM

okay here's my workaround

Copy code

class NewsBodyStream(Stream):

    query_size_parent = 0
    query_array_parent = []
    ibkr_thrift_client = None

    def get_records(self, context: dict | None) -> t.Iterable[dict | tuple[dict, dict | None]]:
        <http://self.logger.info|self.logger.info>("CONFIG = " + str(self.config) + "parentStream " + str(self.parent_stream_type) + "schema " + str(self.schema) + "stream context = " + str(context))

        if self.query_size_parent == 0:
            self.query_size_parent = context['query_size']

        self.query_array_parent.append(context)

        # If this is true, we've gathered the entire upstream query and can proceed to execute
        if len(self.query_array_parent) == self.query_size_parent:
           # EXECUTE THE QUERY AS A BATCH AND YIELD RESULTS ONCE THEY COME BACK IN A LOOP
        else:
            yield {}

emcp

05/26/2023, 8:25 PM

hmmm even that ended up not working.. and instead i had to shoe horn the entire parents result into an array.. and pass that as context.. the problem will inevitably be that I would repeat very large queries over and over for each parent record... I am now looking into the docs about batching config

emcp

05/26/2023, 10:08 PM

I found these https://sdk.meltano.com/en/latest/classes/singer_sdk.Stream.html#singer_sdk.Stream.get_batch_config https://sdk.meltano.com/en/latest/classes/singer_sdk.batch.BaseBatcher.html Not sure how they work, but thanks to the pycharm debugger documentation I can FINALLY use a proper debugger and inspect the singer SDK step by step.. to get a hang of the life cycle

emcp

05/26/2023, 10:08 PM

but for now I may have to settle for one-by-one passing each record to the service

emcp

05/28/2023, 5:05 PM

Well.. I may in fact disentangle my tap unfortuantely.. or just not use parent-child relationships right now.. the behavior isn't clear but.. I would like to have the entire batch of results sent to the child .. instead of what's happening now.. every record calculated by the parent is being sent to the child 1 row at a time.. and my thrift server basically is written (poorly) to disconnect after a single API call to allow the other calls to take a client connection..

emcp

05/28/2023, 5:20 PM

my one curiousity or hope is in re-reading the docs.. and I found this bit here

Copy code

If the number of parent items is very large (thousands or tens of thousands), you can optionally set state_partitioning_keys on the child stream to specify a subset of context keys to use in state bookmarks. (When not set, the number of bookmarks will be equal to the number of parent items.) If you do not wish to store any state bookmarks for the child stream, set state_partitioning_keys to [].

https://sdk.meltano.com/en/latest/parent_streams.html

emcp

05/28/2023, 5:20 PM

will try to find a example of this somewhere

emcp

05/28/2023, 6:06 PM

okay I think I have a path foward... based on this snippet in the SDK class doc string

Copy code

A method which should retrieve data from the source and return records
        incrementally using the python `yield` operator.

turns out I just thought you should yield in all cases.. I will now switch to return the data and see if that changes the behavior (fingers crossed)

7 Views

Open in Slack

Previous Next