Question for the community I have duplicate records for a ce Meltano #best-practices

Question for the community: I have duplicate reco...

Stéphane Burwash

10/21/2022, 8:39 PM

Question for the community: I have duplicate records for a certain id caused by user changes (ex: deal changes primary company, causes duplicate record since primary key is set as association id-toObjectId) We have a

_sdc_received_at

flag, so it is possible to establish which record is the "most recent" Now my question is: What would be better-> 1. Delete the old, deprecated records in my raw data 2. Only query the new, updated records in my staging data Any other suggestions would be appreciated

taylor

10/21/2022, 9:02 PM

I would generally recommend never deleting raw data unless you have a good reason. I would have a “source” table that is a cleaned up version of the raw table that does the basis of fixing data types, cleaning up column names, and doing minor transformations like dedupe, hashing, and splitting if needed.

christoph

10/22/2022, 12:05 AM

I agree with Taylor. Deleting data in a transactional system would be considered bad practice by most people. In my case, I do the same transformations as Taylor mentioned, except that I actually like to keep the deduplication step in an intermediary model in dbt just after the staging model. But that's just personal preference.

Stéphane Burwash

10/24/2022, 12:44 PM

Thank you so much! I'll definitely try the source model approach

taylor

10/24/2022, 1:59 PM

I do recommend following the datasource approach in dbt where each folder corresponds to a data source (API, Databse, etc.) GitLab is a good example here https://gitlab.com/gitlab-data/analytics/-/tree/master/transform/snowflake-dbt/models/sources

christoph

10/24/2022, 8:29 PM

The guide for structuring your dbt project is also hugely valuable. Folder structure and naming conventions for your models helps immensely when scaling the number of models and marts (and team members): https://docs.getdbt.com/guides/best-practices

Stéphane Burwash

10/25/2022, 3:38 PM

Update: I decided to follow @taylor’s advice regarding gitlab's implementation. Here is the format of my new query directly in staging

Copy code

with source as (
        select
                *,
                row_number() over(partition by id order by _sdc_extracted_at desc) as row_number
        FROM {{source('hubspot', 'associations_deals_contacts')}}
),

renamed as (
        select
                CAST(id as STRING) as id,
                CAST(toObjectId as STRING) as toObjectId,
                associationTypes.value.category as association_types_category,
                associationTypes.value.typeId as association_types_type_id,
                associationTypes.value.label as association_types_label,
                _sdc_extracted_at,
        from source,
        UNNEST(associationTypes) as associationTypes
        where row_number=1
)


select
        id,
        toObjectId,
        association_types_category,
        association_types_type_id,
        association_types_label,
        _sdc_extracted_at
from renamed

taylor

10/25/2022, 3:54 PM

@Stéphane Burwash which data warehouse are you using?

Stéphane Burwash

10/25/2022, 3:54 PM

Bigquery

Stéphane Burwash

10/25/2022, 3:54 PM

With stitch as a loader

taylor

10/25/2022, 3:54 PM

ah - if you were using snowflake I was going to recommend https://docs.snowflake.com/en/sql-reference/constructs/qualify.html

Stéphane Burwash

10/25/2022, 3:55 PM

That is a glorious, glorious query option

taylor

10/25/2022, 3:55 PM

actually… https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#qualify_clause

Stéphane Burwash

10/25/2022, 3:57 PM

I'm not sure I understand the difference with a simple where clause though

Stéphane Burwash

10/25/2022, 3:58 PM

Ah! You can write it directly as the query condition

taylor

10/25/2022, 3:58 PM

it’s just cleaner. I think you could do

Copy code

with source as (
        SELECT *
        FROM {{source('hubspot', 'associations_deals_contacts')}}
        QUALIFY rank() over(partition by id order by _sdc_extracted_at desc) = 1
),

as your first CTE and drop the where clause in the second one

Stéphane Burwash

10/25/2022, 3:59 PM

That's an awesome idea, thanks!

Stéphane Burwash

10/25/2022, 6:06 PM

Update update: https://github.com/dbt-labs/dbt-utils#deduplicate-source DBT already offers an optimized deduplication format, which is pretty awesome!

taylor

10/25/2022, 6:51 PM

ah interesting. https://github.com/dbt-labs/dbt-utils/blob/main/macros/sql/deduplicate.sql#L73 is the source for that

taylor

10/25/2022, 6:52 PM

I’d be curious if that actually is more performant or if qualify would help

taylor

10/25/2022, 6:52 PM

I see they use qualify for snowflake

christoph

10/25/2022, 8:24 PM

Was just going to suggest

dbt_utils.deduplicate

😁

alexander_butler

10/25/2022, 8:41 PM

@taylor I tested this awhile back and found

qualify

is better. This code in dbt utils comes from before they introduced the keyword I think. Before "qualify" was introduced, you would typically put the

row_number() over ... as rn

in a subquery and then the outer query would `select * except(rn) where rn = 1`; and in that case, what they did in dbt utils may be more performant. But I dont think it is in todays BQ. Its anecdotal on my end and not properly benchmarked so try both for yourself 😄 and let us know

Open in Slack

Previous Next