Hey guys, I'm seeing some duplicate entries in my...
# singer-tap-development
s
Hey guys, I'm seeing some duplicate entries in my database when using parent-> child interactions going through stitch. Has anyone ever seen this issue / knows how to deal with it? Here is how I implemented to interaction (not the prettiest code, I'm sorry): Parent:
Copy code
class DealsStream(HubspotStream):
    """Define custom stream."""
    name = "deals"
    path = "/crm/v3/objects/deals"
    primary_keys = ["id"]
    partitions = [{"archived": True}, {"archived": False}]

    def get_url_params(self, context: Optional[dict], next_page_token: Optional[Any]) -> Dict[str, Any]:
        params = super().get_url_params(context, next_page_token)
        params['properties'] = ','.join(self.properties)
        params['archived'] = context['archived']
        return params

    @property
    def schema(self) -> dict:
        if self.cached_schema is None:
            self.cached_schema, self.properties = self.get_custom_schema()
        return self.cached_schema

    def get_child_context(self, record: dict, context: Optional[dict]) -> dict:
        """Return a context dictionary for child streams."""
        return {
            "deal_id": record["id"],
            "archived": record["archived"]
        }
Child:
Copy code
class AssociationsDealsToCompaniesStream(HubspotStream):
    name="associations_deals_companies"
    path = "/crm/v4/objects/deals/{deal_id}/associations/companies"
    deal_id = ""
    replication_method = "FULL_TABLE"
    replication_key = ""
    parent_stream_type = DealsStream

    ignore_parent_replication_keys = True

    def get_url_params(
        self, context: Optional[dict], next_page_token: Optional[Any]
    ) -> Dict[str, Any]:
        """Return a dictionary of values to be used in URL parameterization."""
        params = super().get_url_params(context, next_page_token)
        self.deal_id = context['deal_id']
        return params

    def parse_response(self, response: requests.Response) -> Iterable[dict]:
        data = response.json()['results']
        ret = []
        for e in data:
            elem = e
            elem['id'] = self.deal_id
            ret.append(elem)


        return ret
Follow up question: How should we deal with the use case of partions working with child interactions?
e
Hi @Stéphane Burwash. I've never seen this 🤔. Do the duplicate records come from the parent or child?
s
Child. Maybe its a réplication key issue?
e
duplicate entries in my database
so you're only seeing the duplicates in the target database. I see the child stream is missing a primary key, right? That could mean records are being appended instead of upserted.
a
@Stéphane Burwash - do you have the option of using the primary key on the child stream for dedupe? The level of the state_partitioning_keys can also have an impact here. All settings are "safe" in terms of sending records "at least once" but lower grains of state partitioning will have fewer dupes at the cost of more bookmarks being tracked and a larger state object.
I think if you've not overridden the state partitioning keys (and I don't see it overwritten above) then you probably have the default behavior, which is one state record per parent context, which is the highest grain and least duplication.
@visch had recently proposed we document this better. First attempt here, would appreciate any feedback: Add 'at least once' implementation info to SDK docs (!301) · Merge requests · Meltano / Meltano SDK for Singer Taps and Targets · GitLab
s
Thank you for all the responses! As @edgar_ramirez_mondragon pointed out, it's probably a primary key issue; Ill keep you guys updated though. @aaronsteers I'll take a look at the document and give you any feedback I can 😄
Is it normal that when queing up replication, my meltano terminal shows longer and longer "updating state" messages? Update: Nvm I figured it out. This is linked to having a large number of parent streams => https://sdk.meltano.com/en/latest/parent_streams.html?highlight=child#if-you-do-want-to-utilize-parent-child-streams