How to compare two rows to dedupe?

I am trying to dedupe a contact database. There are many fields, name, phone, email, address. Many of the records are duplicates … which to keep? How can I compare the rows to see which are exact matches, and which have differences?

Hi @Johg_Ananda,
this is a recurring issue that - sooner or later - tends to happen in data models that grow without specific blueprints.

Let’s see if this helps:

  1. Determine what it means duplicated i.e. what are the set of rules to assess if B is copy or a variation of A. Usually there are a set of fields that act as a unique key e.g. first name, last name (and/or username, email…)
  2. Implement a way to generate that unique key: a concatenation of those fields into a new column, for instance
  3. Highlight duplicates.
  4. Remove/merge them: if you already have some rules, this can even de done automatically. Otherwise, a manual check on duplicated rows is necessary.

I’d assume you can’t share your table (but if you can, it would be useful).
So I provided a very basic sample:

Let me know if its a good starting point.

2 Likes

@Federico_Stefanato yes this is a great starting point. The key model seems to solve it and allows me to choose the key fields to compare.

Is the regexreplace() to remove spaces // trim() would work?

Actually, Trim() just stripes off spaces at the beginning and at the end of a string. Usually (in these cases) it’s better to “compress” all the spaces so that the char sequence is easier to match