Anyone have some good resources on how to approach...
# best-practices
j
Anyone have some good resources on how to approach identity resolution in a data warehouse? Specifically if you want to make your warehouse a CDP. We’ve got a lot of separate customer records from different taps, and we want to be able to write rules to link and merge these into a global customer id. We’ll probably start with simple rules like email matching, but I want to have a plan for where more complex rules will live and grow as we improve this process. I’ve read a few good articles on the topic, but none of them exactly propose a schema. The last one is close, but is much more oriented to analytics data than relational data, and I’m having trouble picturing how to adjust the structure. For example, there’s a lot of discussion around linking an anonymous id to a user id retroactively after login. But our problem is slightly different; we have a Shopify user id and an application database user id and a Mixpanel distinct ID, etc. some of which are duplicated (there might be two Shopify customer rows from different orders who are actually the same user).
Perhaps trying to build CDP functionality around relational data is the first problem? Like perhaps we should transform e.g.
shopify_orders
into
events where event_name = 'Order placed'
?
s
This might not be helpful at all, but what I've in general experienced to being helpful when thinking about related problems is the following idea: Keep the mindset of data being "the truth", and your task is to create a new computed entity. You're not trying to match & link things that are "duplicates". Rather you're trying to provide a new computed entity that unifies several true pieces of data. So it would be totally fine to simply build a new model called "users_using_ER_algo1" and then another one for a different algo,... . It's just a mindset shift, but it helped me realize what I'm really doing (not trying to find some "truth" but rather to create something new that is a good abstraction of something else). Hope that makes any sense.
j
That mindset shift is helpful! It’s easy to think that way with simple transforms, but I think you’re right that we shouldn’t lose that mindset just because the transforms are more complex
It makes me think of like formal logic/proofs. At each step, you’re transforming one truth into another which is more convenient 🤯
s
To give an actual example, I worked in a company with multiple "customer" concepts, 1. like the "customer" entity the registration component produced 2. the CRM Systems' concept of a "customer" 3. A third systems concept of a customer. There were 3-4 heuristics to identify a true human behind all of these, and through these heuristics, these could be linked up like 50% of the time. So instead of acting like we had one unified "customer" dataset inside the data warehouse, we created four customers: 1. customers_by_registration 2. customers_by_crm 3. customers_by_3rd 4. (!) analytics_customers Number 4 used the heuristics and had a few extra things like a unique identifier created by us, explanations and transparancy around how duplicates where matched up. So users had the choice of which kind of customer they wanted to work with.
v
Awesome stuff here, what I've found in my experience has been that I tend to want to go fix the source systems and "merge" the records together as it makes life a lot easier for everyone. So we tended to make duplicate detection systems and then have folks go merge the duplicates in the source systems as we controlled them. 🤷 Doesn't' work in all cases but if you can get away with that it's probably easier
c
"fix the source, silly" ... F.T.S.S. ?? not quite as catchy as K.I.S.S
v
The real tragedy imo is that there isn't a lot of sharing around duplicate detection that I've found but it is highly context dependent. I've found SOUNDEX is chef kiss for name duplicate detection (and it's including with mssql) 🤷
j
Unfortunately I think Sven is right here; it’s not that the source data is wrong; it’s truthful that there were two orders with different email addresses. The de-duplication is purely a heuristic optimization we apply using additional context. So even if it were a system we control (which many like Shopify we don’t), it wouldn’t make sense to fix at the source
v
Yeah whatever works for you! I personally think that the order in your source system should be related to a single customer and dedupped there but that doesn't make sense in all cases. No right/wrong imo 🤷
s
but what might still make sense, and it might be what @visch is referring to is: You can still ask your stakeholders to add identifiers to the source systems so your calculated model becomes better (In my above case the registration component carries a "CID" and people started to put in the "CID" as reference into the CRM system to make our model better)
v
Yeah I tend to think working holistically with the other business departments to get them on board (as they probably have issues not knowing clients are the same people either) is better than just fixing it in your analytics model. It's harder work because you have to herd cats but for the biz it's almost always better for everyone. depends of course on what you're after and why 🤷 I've seen folks do both but I tend to recommend talking to people as it tends to drive value immediately for the source system people
h
I used to work at an offiical statistics agency (the ones doing the census, consumer price index etc), and there was a whole literature on this. Entity resolution, entity matching, fuzzy matching, etc. It goes by many names. A quick google scholar search: https://scholar.google.com/scholar?hl=no&as_sdt=0%2C5&q=entity+matching+official+statistics&oq=entity+matching+official+s But maybe a scholarly deep-dive isn’t really on the table, so the one short answer I have, is that in addition to soundex there are some other neat algorithms like edit distance and levenshtein distance that can find close names (especially good for when there’s just an initial missing or a spelling mistake). I know some databases have these functions built in, and if not you can find python libraries.
j
Thanks! I’ll check out these materials. In general, my question is less about the rules themselves (I’m familiar with levenshtein distance and the like; to start our rules are going to be really simple like case-insensitive string matches) and more about table schemas. In case anyone is curious or stumbles across this question in the future, we’re going to start with a solution like this: 1. Make a table of known identifier pairs by union-ing all sources that have customers, transforming common identifiers like email addresses and phone numbers with a shared sanitize/normalize function. (I’m showing source names and values as tuples but would be separate columns)
Copy code
| (application_user_id, 5)     | (email, <mailto:julian@example.com|julian@example.com>)     |
| (application_user_id, 5)     | (phone, +14045551234)           |
| (shopify_customer_id, 12345) | (email, <mailto:julian@example.com|julian@example.com>)     |
| (shopify_customer_id, 67890) | (email, <mailto:julian@example.com|julian@example.com>)     |
| (mixpanel_distinct_id, ABCD) | (email, <mailto:julian.alt@example.com|julian.alt@example.com>) |
| (mixpanel_distinct_id, ABCD) | (application_user_id, 5)        |
| (application_user_id, 6)     | (email, <mailto:othercustomer@foo.com|othercustomer@foo.com>)  |
2. Attach to these a generated
global_customer_id
(ideally deterministically so that they are more stable), assigning the same id to all rows that have values in common
Copy code
| (application_user_id, 5)     | (email, <mailto:julian@example.com|julian@example.com>)     | 1 |
| (application_user_id, 5)     | (phone, +14045551234)           | 1 |
| (shopify_customer_id, 12345) | (email, <mailto:julian@example.com|julian@example.com>)     | 1 |
| (shopify_customer_id, 67890) | (email, <mailto:julian@example.com|julian@example.com>)     | 1 |
| (mixpanel_distinct_id, ABCD) | (email, <mailto:julian.alt@example.com|julian.alt@example.com>) | 1 |
| (mixpanel_distinct_id, ABCD) | (application_user_id, 5)        | 1 |
| (application_user_id, 6)     | (email, <mailto:othercustomer@foo.com|othercustomer@foo.com>)  | 2 |
3. Attach the
global_customer_id
to any source tables desired by joining on their identifier
Copy code
SELECT shopify_customers.*, global_customers.global_customer_id
FROM shopify_customers
LEFT JOIN global_customers on global_customers.source = 'shopify' and global_customers.value = shopify_customers.shopify_customer_id
Step 2 is a bit hand-wavy currently; I think the SQL there is more complicated than it sounds. But we’ll see how it goes