Button to Delete Duplicates

I often find myself wanting a single button that will delete all columns with duplicate values.

For example, imagine I have this table:
image

I want to remove either “Ben H” or “Ben Howit” because they have the same email address. But I don’t want to remove both.

How can I do this in Coda? Well, I’m not the first to ask:

To varying degrees, these don’t work. One of my criteria is I don’t want to have to add a column to my existing table. So no use of rowNumbers like in Paul’s answer

Note: Why avoid table based checks?

Searching for duplicates in a table is rather computation intensive. It can really slow down a doc if you have tons of rows.

The main issue is that each row has to search all previous rows to see if it has a duplicate value. (In computer science speak we’d call this a O(n^2) algorithm, which means basically “slow”)

There’s no way to avoid running the algorithm if you’re trying to get rid of duplicates, but the downside of the table approach is that each time you add a row to your table your table has to rerun the entire calculation.

That can really slow you down. By putting it in a separate button, the calculation is only triggered when you hit the button. Temporarily slowing you down, sure, but only when the button is pressed, not all the time.

Thanks to WithName, it’s now possible to do all of this in one formula.

Formula

Here’s the formula. Just replace all instances of

  1. Table with your table name (shows up in line 2 and 3). In our case this is Emails
  2. ColumnWithDuplicateValues with the name of the column that contains duplicates you want to delete (shows up in row 2, 3, 6, and 8). In our case this is Email
WithName(
  Table.Filter(WithName([ColumnWithDuplicateValues], Value,
    Table.CountIf([ColumnWithDuplicateValues] = Value) > 1
  )), AllDuplicates,
  WithName(
    AllDuplicates.[ColumnWithDuplicateValues].Unique().FormulaMap(
      WithName(CurrentValue, UniqueValue, 
        AllDuplicates.Filter(CurrentValue.[ColumnWithDuplicateValues]=UniqueValue).First()
      )
    ),
    FirstInstances,
    AllDuplicates.Filter(FirstInstances.Contains(CurrentValue).Not())
  )
).DeleteRows()

If you just want to list all the duplicate rows after the first occurring duplicate row, remove the .DeleteRows() from the end.

Here’s what that output looks like:

It’s pulling out the second instance of benh@example.com and the second and third instance of test@test.com

By putting this formula in a button, and leaving the .DeleteRows() at the end, we would delete these three rows.

How Does it Work?

Glad you asked! I’ll put this in a hidden section, since many people might be satisfied with having the formula.

AllDuplicates

First we want to find all the duplicate rows in our table. This formula does that.

WithName(
  Table.Filter(WithName([ColumnWithDuplicateValues], Value,
    Table.CountIf([ColumnWithDuplicateValues] = Value) > 1
  )), AllDuplicates
...

Here’s what AllDuplicates contains:

This is a nice formula to have in your back pocket anyway.

FirstInstances

Next we need to identify all the places where a value shows up for the first time. Remember, we don’t want to delete the first instances, just all the rest!

...
  WithName(
    AllDuplicates.[ColumnWithDuplicateValues].Unique().FormulaMap(
      WithName(CurrentValue, UniqueValue, 
        AllDuplicates.Filter(CurrentValue.[ColumnWithDuplicateValues]=UniqueValue).First()
      )
    ),
    FirstInstances,
...

Here’s what FirstInstances contains:

Only the Duplicates

Finally, we want to remove those first instances (the ones in FirstInstances) from the list of duplicates we found (the ones in AllDuplicates). Here’s how we do that:

...
  AllDuplicates.Filter(FirstInstances.Contains(CurrentValue).Not())

And that’s it! That’s how we get our beautiful result:
image

14 Likes

Wow. Really thorough Post! So great how you not only give a Solution but explain all the backgrounds.
Thank you

1 Like

Hi Connor,
I had a pretty big table and your formula gave me just a spinning wheel.
So I made another formula that worked for me. Here it is:

[Table].FormulaMap(
  WithName(CurrentValue, CurrentRow , 
  CurrentRow.FormulaMap(
    WithName([DB Form labels].Filter([ColumnWithDuplicateValues]=CurrentRow.[ColumnWithDuplicateValues]), DuplicateRows, 
      DuplicateRows.FormulaMap(
        If(DuplicateRows.Count()>1, 
          Sequence(2,DuplicateRows.Count()).FormulaMap(
            DuplicateRows.Nth(CurrentValue).DeleteRows(ThisRow)
          )
        , _Noop())
      )))))
2 Likes

This post popped up for me, and it’s fun because that’s exactly what I’m doing now: building a pluggable solution that finds duplicates and either tries to merge the data or offers a merging interface if there’s a conflict:

Here’s the demo (recorded earlier, I’m improving this template now)

And here’s how I was building it live:

The final template will be available to my Patreon subscribers who’re active in November :slight_smile: so if you wish to get it, pls subscribe by the end of this month. Also maybe I’ll be selling these eventually through the Doc Gallery but it will be $25-ish then.

3 Likes

P.S. @Connor_McCormick good call on the column formula performance. Frankly, I don’t remember why my answers in those posts you linked to were specifically that. Something about the context, I guess — maybe I was just trying to help those people fix whatever they already had.

You fell to this tunnel vision yourself btw :wink: I’d argue that it should be a button to delete duplicates. IMO it shouldn’t be just a button — it should be a process. Because in real scenarios there’s more to it than just deleting repeated entries:

  • How do we determine which ones to keep and which ones to delete? Earlier one wins? Or the one that’s edited the most recently?
  • What if there’s valuable info in the duplicate we’re about to delete, e.g. fresher data or missing values from the original record?
  • What about row activity and comments? How that would affect which one we keep?
  • How do we deal with broken references? Some records in our DB could be linked to one copy and some to another.

But yes, having a column that actively recalculates “is duplicate” flag on any change is indeed gonna get horrendously slow. Three better ways:

  1. Have a button on each row that will search for duplicates upon click only for that row
  2. Have an external button like @Connor_McCormick suggests.
  3. Implement an Iterator table like I usually do for this and like you see above.

The first solution is best suited if you want to check on duplicates as you’re entering data into an already big table. I did this for a client that had a table of 8000+ contacts and only wanted to check whether any of the recently added bunch could be overlapping with the existing ones. They would enter a name and an email and press a button inline in that table. It would find if anything is overlapping with the current row, and they could decide whether to keep filling out the row or discard it and move onto the next one.

The second solution is the fastest because it runs the whole check in memory in a single formula. The big disadvantage though is that you cannot debug what it’s doing: essentially if you see something calculated wrongly, you’ll have to guess-fix the formula and rerun the whole check again.

I often advocate for the third solution because it’s the most flexible:

  • You can split your process down into separate clear steps (but you don’t have to)
  • If split into steps, you can manually test each step (e.g. unit-test how your button that merges two contacts would do so on two arbitrary contacts that you’ve picked, not run the whole process)
  • You can build extra indication, e.g. a progress bar to show you how dupe checking is going
  • It’s good practice to separate logic from data

ezgif.com-gif-maker (18)

3 Likes

Ah yeah, slicing the sequence to find the values to delete is a good idea @Tomislav_Mamic! Thank you for sharing.

I think I wanted to have a modular formula because sometimes I find myself wanting the Unique Values, sometimes I want the First Instances, and sometimes I want All Duplicates. But I wonder if I’d be better served by making separate formulas for each of those.

How much slower do you think my formula is than yours @Tomislav_Mamic and @Paul_Danyliuk? What is the fastest way to find duplicates?

I haven’t read closely into your formula because that’s not how I’d do it anyway :slight_smile:

I would never trust a button to automatically delete the data without giving the user a chance to review what’s about to happen. Even in my upcoming dupe finder and merger, I’ll first write out the list of all duplicate pairs (original and dupe), then let the person review and swap A&B where applicable, only then let the person e.g. auto-resolve what’s possible with a single click (delete where no data loss would happen; merge wherever there’s no conflict etc), and then let them manually merge conflicting duplicates.

The iterator table approach is the slowest because there’s many ModifyRows in the process. But a) it gives you the progress bar, i.e. some visibility into how long it would take, and b) it allows you to test / work through things step by step. IMO a) is very important, i.e. it’s better if a process runs for 5 minutes but the user can see that it’s not stuck and they can go grab a coffee meanwhile, than if it runs for 2 minutes but the user is sitting clueless about whether they can touch something else or not or whether they should refresh the tab…

CountIf() is a formula to avoid as non-optimized. That’s what they write in the docs.

1 Like

Totally agree that it’s better to show people the items when you’re about to delete them.

Looking forward to seeing your doc if you choose to share it with the community!

1 Like

There’s insane effort I’m putting into this template, particularly into making it easily pluggable into whatever doc. I know I said $25-ish earlier but now I see it’s easily in the $100-ish ballpark. Everything is black magic and there’s only two buttons a user would have to adjust (actually one of those is optional).

Once it’s plugged to a whatever database, the template will self-adjust to merge columns from those tables. It also has auto-merge capability where it could combine two records quickly and automatically (pick non-blank values from either and combine multi-select lists) and leave conflicting records (any column is non-blank and non-equal for the two records) for manual review.

It’s still going to be this month’s Patron Special so only for the remaining week you can subscribe to snag it for $10.

2 Likes

Okay, for the record, I finally finished it. Please watch the demo :slight_smile:

2 Likes

What about sorting all the table rows of the table by the column in memory, then doing a pass over to count consecutive duplicates. O(n^2) → O(nlogn) which works out to 99% fewer iterations for 500 rows and more.

1 Like

@louca.developer , how would you formulate your suggestion in the Coda Formula Language?

That’s an interesting idea to test out. I’m not 100% sure it would work faster though. Coda creates indices for filters, essentially making them operate in O(1). Not sure if this applies in this situation though, but I’ve seen this in some docs where I tried to optimize around filters only to realize the straightforward approach actually worked the best.

UPD. Besides this approach only works if you’re looking for duplicates by a single column (e.g. same name). If you want to extend it to look for duplicates more eagerly, e.g. same name OR same email OR same SSN or something, then of course this simply won’t work because the potential matches won’t be sequential rows (e.g. you’d have everyone sorted by name, but candidates with the same email but different names would be far apart.)

And in my experience it’s best to always look for potential duplicates by more criteria, then list them all and let the end user decide which ones to discard as false positives and which one to act upon. Hence why I did my interface just as I did it — I actually built it for a client and already integrated it in their doc (and they’re loving it), so it’s the real world scenario.

Oh, and last sentiment. Yeah, in that real client’s doc my merger runs for ~2 minutes on a ~1000 record database. Yeah, not ideal, but:

  1. there’s a progress bar indication that’s super helpful: it lets the person estimate remaining time and allow to go afk do something else in the meantime.
  2. in the large scheme of things this is not a catastrophe. It’s an action that is run perhaps once in a week or once after a significant data dump. If speed ever became an issue, I would rather optimize around the idea of not running checks on the rows we’ve already checked and merged but only on the recently added ones instead of trying to find a better algo to run over the whole dataset over and over.
1 Like

Hi @Paul_Danyliuk @Tomislav_Mamic,

First of all, thank you for all of your posts! I’ve learned a lot about Coda’s hidden formulas and tricks by reading them.

While looking for an answer to something else, I came across this thread. I had previously solved this problem in a way similar to what @louca.developer said. It works for both single-column comparisons as well as multi-column comparisons. If you have multiple columns, all you need to do is sort by each before running the Sequence/Nth-based iterator. This still runs in O(n*log(n)) vs. O(n^2) for the Filter and Count/CountIf approach.

For comparison, on a ~20k record data set, the O(n*log(n)) Sequence/Nth-based approach takes less less than one second, versus ~40 minutes for the O(n^2) Filter/Count/CountIf-based approach. Both yield the same correct result.

Here is a set of sample formulas:

Canvas formula (called “Duplicates”) – runs in O(n*log(n)) time:

WithName(
  [Table With Duplicates]
    .Sort(True,[Table With Duplicates].Field1)
    .Sort(True,[Table With Duplicates].Field2)
    .Sort(True,[Table With Duplicates].Field3),
  SortedValues,
  Sequence(1,SortedValues.Count()-1).FormulaMap(
    WithName(SortedValues.Nth(CurrentValue), curVal,
    WithName(SortedValues.Nth(CurrentValue+1), nextVal,
    If(Or(
      curVal.Field1 != nextVal.Field1,
      curVal.Field2 != nextVal.Field2,
      curVal.Field3 != nextVal.Field3),
      "", List(curVal,nextVal))))))
.ListCombine().Unique().Filter(CurrentValue.IsNotBlank())

Column formula called “Is duplicate?” (can also be used for conditional formulas):
thisRow.In(Duplicates)

Summary of the approach:

  1. Sort the data by the values you want to use for comparison – O(n*log(n))
  2. Compare each value (besides the last) to the one after. – O(n)
  3. If equal, it’s a duplicate. If not, it’s not. – O(c)

Simpler version if you only want to find duplicate values within a single column:

WithName([Table With Duplicates].FieldToMatch.Sort(), SortedValues,
  Sequence(1,SortedValues.Count()-1).FormulaMap(
    WithName(SortedValues.Nth(CurrentValue), curVal,
    WithName(SortedValues.Nth(CurrentValue+1), nextVal,
    If(curVal!=nextVal,"",curVal)))))
.Unique().Filter(CurrentValue.IsNotBlank())

plus the corresponding column / conditional format formula:
thisRow.FieldToMatch.In(Duplicates)

Hopefully that helps someone save time both creating formulas and running them :slight_smile:

8 Likes

Resolved :smile:

Hey Connor,

I need your help. I am brand new to coda with no coda or excel function background what so ever. Learning on the go.I tried your formula but I am having some issue. I ran your formula on a button for a tablewith multiple duplicates. However each time it is only deleting anything more than 2 dupes and retaining the first two. so for each duplicate row it always retains at least one duplicate. Let me know if you.

Hey Connor,

I need your help. I am brand new to coda with no coda or excel function background what so ever. Learning on the go.I tried your formula but I am having some issue. I ran your formula on a button for a tablewith multiple duplicates. However each time it is only deleting anything more than 2 dupes and retaining the first two. so for each duplicate row it always retains at least one duplicate. Let me know if you.
Before action :

After action:

Okay there were some spaces at the end of part numbers. Once cleared they were gone. All Set! Thanks team