Question for the community: I have duplicate reco...
# best-practices
s
Question for the community: I have duplicate records for a certain id caused by user changes (ex: deal changes primary company, causes duplicate record since primary key is set as association id-toObjectId) We have a
_sdc_received_at
flag, so it is possible to establish which record is the "most recent" Now my question is: What would be better-> 1. Delete the old, deprecated records in my raw data 2. Only query the new, updated records in my staging data Any other suggestions would be appreciated
t
I would generally recommend never deleting raw data unless you have a good reason. I would have a “source” table that is a cleaned up version of the raw table that does the basis of fixing data types, cleaning up column names, and doing minor transformations like dedupe, hashing, and splitting if needed.
c
I agree with Taylor. Deleting data in a transactional system would be considered bad practice by most people. In my case, I do the same transformations as Taylor mentioned, except that I actually like to keep the deduplication step in an intermediary model in dbt just after the staging model. But that's just personal preference.
s
Thank you so much! I'll definitely try the source model approach
t
I do recommend following the datasource approach in dbt where each folder corresponds to a data source (API, Databse, etc.) GitLab is a good example here https://gitlab.com/gitlab-data/analytics/-/tree/master/transform/snowflake-dbt/models/sources
c
The guide for structuring your dbt project is also hugely valuable. Folder structure and naming conventions for your models helps immensely when scaling the number of models and marts (and team members): https://docs.getdbt.com/guides/best-practices
s
Update: I decided to follow @taylor’s advice regarding gitlab's implementation. Here is the format of my new query directly in staging
Copy code
with source as (
        select
                *,
                row_number() over(partition by id order by _sdc_extracted_at desc) as row_number
        FROM {{source('hubspot', 'associations_deals_contacts')}}
),

renamed as (
        select
                CAST(id as STRING) as id,
                CAST(toObjectId as STRING) as toObjectId,
                associationTypes.value.category as association_types_category,
                associationTypes.value.typeId as association_types_type_id,
                associationTypes.value.label as association_types_label,
                _sdc_extracted_at,
        from source,
        UNNEST(associationTypes) as associationTypes
        where row_number=1
)


select
        id,
        toObjectId,
        association_types_category,
        association_types_type_id,
        association_types_label,
        _sdc_extracted_at
from renamed
t
@Stéphane Burwash which data warehouse are you using?
s
Bigquery
With stitch as a loader
t
ah - if you were using snowflake I was going to recommend https://docs.snowflake.com/en/sql-reference/constructs/qualify.html
s
That is a glorious, glorious query option
s
I'm not sure I understand the difference with a simple where clause though
Ah! You can write it directly as the query condition
t
it’s just cleaner. I think you could do
Copy code
with source as (
        SELECT *
        FROM {{source('hubspot', 'associations_deals_contacts')}}
        QUALIFY rank() over(partition by id order by _sdc_extracted_at desc) = 1
),
as your first CTE and drop the where clause in the second one
s
That's an awesome idea, thanks!
Update update: https://github.com/dbt-labs/dbt-utils#deduplicate-source DBT already offers an optimized deduplication format, which is pretty awesome!
t
I’d be curious if that actually is more performant or if qualify would help
I see they use qualify for snowflake
c
Was just going to suggest
dbt_utils.deduplicate
😁
a
@taylor I tested this awhile back and found
qualify
is better. This code in dbt utils comes from before they introduced the keyword I think. Before "qualify" was introduced, you would typically put the
row_number() over ... as rn
in a subquery and then the outer query would `select * except(rn) where rn = 1`; and in that case, what they did in dbt utils may be more performant. But I dont think it is in todays BQ. Its anecdotal on my end and not properly benchmarked so try both for yourself 😄 and let us know