Somewhat urgent question: For child streams, does ...
# troubleshooting
f
Somewhat urgent question: For child streams, does an incremental continue to query child streams if the parent stream query no longer returns data that causes it to return anything from get_child_context? We have deleted an account, and it is not returned by the API anymore, hence get_child_content is not even called, let alone the tap returning anything, as the record is simply not there. However, on incrementals for a child stream, it is still querying for that old account_id. I checked in the meltano DB and in the job table there is still the account_id in the state info (in payload field/column). Do we really have to manually edit the job table to remove old account_ids?
a
Hi, @fred_reimer. This is an interesting use case. Is this a publicly available tap I could look at? What comes to mind is this line of code, which falls back to
partitions
if context is not set. The
partitions
list is seeded from the last
STATE
message, so the behavior you describe would make sense but only if
context
is missing/empty. That said, as long as you still have the parent-child relationship in tact, I don't know why
context
from the parent would not be used.
Copy code
context_list = [context] if context is not None else self.partitions
Would be helpful to look at the code, but short of that, can you help me understand the parent-child relationship and if you have set any value for
ignore_parent_replication_keys
and/or
state_partitioning_keys
? I would not expect the old
STATE
partitions to be cleaned out but I also would not expect the parent's children to be continually queried when the parent does not exist.
If the
n
of parent count is not very large, you can avoid partition-level bookmarks by setting
state_partitioning_keys
to a higher-level granularity or to
[]
to track just a single stream per key.
f
The tap is not public, but it's not particularly proprietary. Just a tap for a SAAS solution that we utilize as a customer/partner. Basic structure is: • accounts stream ◦ primary_keys ["account_id"] ◦ get_child_context - return {"account_id": record["account_id"]} • account_info stream ◦ parent_stream_type accounts stream ◦ ignore_parent_replication_keys True ◦ primary_keys ["account_id"] ◦ replication_key "timestamp" ◦ uses context to access account_id to make queries, set account_id in record, etc. So it's all fairly straight forward. the
n
is not very large now, accounts is maybe a dozen or two, but it will grow (not to thousands). We are not doing anything fancy here. Just when accounts no longer processes a record for a deleted account_id, then child stream is still trying to do an incremental and sync. That is, until we manually edited the job record in the DB and updated the payload for the last id for the job_id, which worked. But this can't be a manual process. This has to work automatically....
@aaronsteers done, but if you are on your weekend enjoy. I appreciate the second set of eyes, and any recommendations you may have. Thanks!
I can look into publishing this tap. Like I said, it's not particularly proprietary, and is for a third-party SAAS service. I'll let you know.
a
Thanks for sharing this detail. For short term, I do think there's a workaround to set
state_partitioning_keys = []
on the account_info child stream. Do you mind testing this if feasible to do so? And also, could you open an issue so we can look into the root cause?
The account id key sticking around in the state sartitions list in the job record is expected. But that partition definition driving the list of account IDs is not expected.
(As a one-time operation, you may also need to remove the old partitions, or just start a new job id for this test.)
f
I can likely test this next week. Stay tuned...