So we have a list of users from our shareholder po...
# getting-started
f
So we have a list of users from our shareholder portal, as well as user data from our online store and we want to build a small internal tool for CRM and data analytics. Right now i am syncing and cleaning up the data from our online store DB (mysql) as well as our Shareholder portal DB (Postgres) into a new postgres db with meltano, that works great. Now i wanted to know if it is possible (and the right tool for the job) to do data merging and deduplication. For example, in the online store users can order with their account, and also as guests with just their email -- but often the same people sometimes order with their account and sometimes as a guest (i know thats annyoing but well it it s how it is). So I want to merge users with the same email, decide which data to use (for example addresses, preferably use the address from the account/guest with that email that ordered most recently) and sum up their total count and invoice amount of their orders. Then i want to merge this again with our shareholder database, to result in a single user table that contains all the information. This then needs to run periodically, say once an hour to update data. Is meltano the right tool for the job here?
v
The tool most people recommend for "transformations" is dbt. https://www.getdbt.com/ Now what you're asking for is more like "entity resolution" so I don't have an exact answer for you but if it's sql queries you want to run dbt is the way. Meltano can run dbt
https://docs.meltano.com/getting-started/meltano-at-a-glance#extracting--loading-data walks through it. EL is the piece you're doing right now with meltano. T is the piece you're asking about. Most folks go use dbt, I use dbt and sometimes python (and pandas) to pull and push data back into my database when sql isn't the tool I want to use for the job
f
thank you! That helps me a lot, especially "entity resolution" makes it easier for me to continue my research (no i know what to google 😉 )
v
I had no idea what to google for that either, I still even after googling just use sql queries and a little python to keep it simple for my use cases 🙂
s
Btw. since I#ve done something like this in the past, I've seen three different approaches to this you might choose: 1. Use SQL (= dbt) and do some hand coding and matching (you can get quite far with very little SQL) 2. Use custom Python and build a small "model" (could be based on ML, doesn't have to be), and push that data into your data warehouse (as a matching table) 3. Use custom solutions like Zigg/Zing AI (open source) to do larger scale entity resolution. I'd order 1->3 in order of difficulty and would usually start at the top, if the stuff is useful to people, go down the chain.
f
thanks! Yeah i think i'll go with dbt at first and see how far i can get. I've got some ML experience so i will have a look at tools like dedupe or zingg later 🙂