I am trying to dedupe a contact database. There are many fields, name, phone, email, address. Many of the records are duplicates … which to keep? How can I compare the rows to see which are exact matches, and which have differences?
this is a recurring issue that - sooner or later - tends to happen in data models that grow without specific blueprints.
Let’s see if this helps:
- Determine what it means duplicated i.e. what are the set of rules to assess if B is copy or a variation of A. Usually there are a set of fields that act as a unique key e.g. first name, last name (and/or username, email…)
- Implement a way to generate that unique key: a concatenation of those fields into a new column, for instance
- Highlight duplicates.
- Remove/merge them: if you already have some rules, this can even de done automatically. Otherwise, a manual check on duplicated rows is necessary.
I’d assume you can’t share your table (but if you can, it would be useful).
So I provided a very basic sample:
Let me know if its a good starting point.
@Federico_Stefanato yes this is a great starting point. The key model seems to solve it and allows me to choose the key fields to compare.
regexreplace() to remove spaces //
trim() would work?
Trim() just stripes off spaces at the beginning and at the end of a string. Usually (in these cases) it’s better to “compress” all the spaces so that the char sequence is easier to match