Highlight duplicates in a column using a conditional format

Hi Tom, sorry I was traveling for a bit. Did you figure this one out?

sure didn’t! have been working on other things! If you have any insight that would be great though! hope you had an awesome trip :slight_smile:

The formula above should work for you - give you the number of times a a name has been listed and you can filter to remove the ones that are less than 1. Or use the filter Evan used.
However, I am happy to take a look at your particular document - share it with support@coda.io or mallika@coda.io and I can take a look.

(I did have a wonderful time traveling and am back ready to look at docs and more docs! )

Maybe because of some changes in Coda, I think there a couple things to update about this:

  • missing a close bracket here after “thisrow.Column)”
  • when trying to use this formula in my doc, it keeps autocorrecting out the “thisrow.”, and works without it

The new formula should read:
Table.Filter(Column.Contains(Column)).Count()>1

@mallika would mind updating the original post for others’ later reference?

2 Likes

Thank you that was great help! It works for me but with a quirk.

I’d like to add some more detail for others that might come across this situation.
Seems like Table.Column and thisRow.Column are shown in the same way currently in Coda. But you have to make sure that you choose the correct version.

thisRow.Column for the first instance
image

And Table.Column for the second instance
image

1 Like

How would I tweak this to only highlight duplicates within the same group?

1 Like

Only thing that worked for me was to use @evan’s filter in the conditional formatting.

@Guttorm_Flatabo Welcome!
I am glad you got the filter working. Do you want to elaborate on what did not work for you?

Does anyone know how to edit the formula so that it doesn’t recognize blank cells as “duplicates”?

Hi, @Brittany_Dufresne!

Add AND thisRow.Column.IsNotBlank() inside the filter.

This formula checks if value of the column “Column” for the currently evaluated row is not blank (that it has any value).

Full formula:

Table.Filter(Column.Contains(thisRow.Column) AND thisRow.Column.IsNotBlank()).Count() > 1

5 Likes

This is awesome, thank you!

Hello,

This formula causes big performance problems on a 825 rows table (5 minutes 37 seconds loading time). Is there a more optimal way to proceed?

I tried moving the formula to a checkbox column, then applying conditional formatting based on that column, since I understood that it was better to put formulas in columns to optimize performance. But it doesn’t seem to change anything.

Thanks in advance!

1 Like

Hi @Mata-i ,

Since performance matters to you, I suggest using an alternative approach to finding duplicates. The ones involving Filter and Count/CountIf will give you the correct answer, but they have to loop through the whole data set fully for each item in the data set (in computational complexity terms, this is an O(n^2) algorithm because it touches each item n^2 times. This is VERY slow!

Instead, first sort the data. Then, compare each value (besides the last) to the one after. If equal, it’s a duplicate. If not, it’s not. This results in an O(n*log(n)) algorithm (the slowest part is the sorting), which is much faster than an O(n^2) algorithm.

For comparison, on a ~20k record data set, the first approach (using CountIf and Filter) takes about 40 minutes. The second approach takes less than one second. Both are equally correct.

Here is a sample canvas formula (it ignores duplicate blanks, but that would be easy to change):

WithName([Table With Duplicates].FieldToMatch.Sort(), SortedValues,
  Sequence(1,SortedValues.Count()-1).FormulaMap(
    WithName(SortedValues.Nth(CurrentValue), curVal,
    WithName(SortedValues.Nth(CurrentValue+1), nextVal,
    If(curVal!=nextVal,"",curVal)))))
.Unique().Filter(CurrentValue.IsNotBlank())

If you require an “Is Duplicate?” column or a conditional formula, you can name the canvas formula above “Duplicates” and then use a formula like: thisRow.FieldToMatch.In(Duplicates), which runs in O(n) time (i.e., very quickly).

If you want to find duplicate table rows based on some of the values of multiple columns, rather than duplicates within a single column, you can also use this more complex version of the formula:

WithName(
  [Table With Duplicates]
    .Sort(True,[Table With Duplicates].Field1)
    .Sort(True,[Table With Duplicates].Field2)
    .Sort(True,[Table With Duplicates].Field3),
  SortedValues,
  Sequence(1,SortedValues.Count()-1).FormulaMap(
    WithName(SortedValues.Nth(CurrentValue), curVal,
    WithName(SortedValues.Nth(CurrentValue+1), nextVal,
    If(Or(
      curVal.Field1 != nextVal.Field1,
      curVal.Field2 != nextVal.Field2,
      curVal.Field3 != nextVal.Field3),
      "", List(curVal,nextVal))))))
.ListCombine().Unique().Filter(CurrentValue.IsNotBlank())

and thisRow.In(Duplicates) for the column / conditional format formula.

9 Likes

Dear @Matthew_Strax-Haber, thanks a lot sharing this very valuable perspective :trophy:
I like also your very practical/clear naming in “WithName()”, so it’s very easy to understand the code.

Always eager to learn :student:

I’m also going to link my solution here in case someone finds this thread :slight_smile:

It doesn’t highlight duplicates with conditional formatting or any other live recalculation but does so in manually run passes. And provides some gorgeous UI for resolving those duplicates:

1 Like

Thank you a lot for this solution, which is indeed much faster on my doc!

Can there be an updated post regarding this? I am using the formula and it works, but for my case I need to filter TWO columns not one.
Eg I have ‘first name’ and ‘last name’ columns, I want to filter out duplicates of people’s registration. I cannot do only one column as many people share the same firstname/last name and they should NOT be considered a duplicate but with this formula does. Any quick solutions for this with the formula without creating a separate column?

Hello @Fawziyah_Nabeelah ,

That is actually not that complicated anymore. You don’t need a helper column (although you can), but you do need one helper field, usually hidden on a settings page. For clarity I put it on my sample page. The switch is optional, but I like it to find duplicate in long lists.

Copy my demo doc and explore the formulas:

1 Like

Can you please make this doc public? or allow me access? I’m unable to view it!
and thankyou!!

Yes, just changed the settings. In order to see the formulas you need to make your own copy after opening this doc.