Button to Delete Duplicates

That’s an interesting idea to test out. I’m not 100% sure it would work faster though. Coda creates indices for filters, essentially making them operate in O(1). Not sure if this applies in this situation though, but I’ve seen this in some docs where I tried to optimize around filters only to realize the straightforward approach actually worked the best.

UPD. Besides this approach only works if you’re looking for duplicates by a single column (e.g. same name). If you want to extend it to look for duplicates more eagerly, e.g. same name OR same email OR same SSN or something, then of course this simply won’t work because the potential matches won’t be sequential rows (e.g. you’d have everyone sorted by name, but candidates with the same email but different names would be far apart.)

And in my experience it’s best to always look for potential duplicates by more criteria, then list them all and let the end user decide which ones to discard as false positives and which one to act upon. Hence why I did my interface just as I did it — I actually built it for a client and already integrated it in their doc (and they’re loving it), so it’s the real world scenario.

Oh, and last sentiment. Yeah, in that real client’s doc my merger runs for ~2 minutes on a ~1000 record database. Yeah, not ideal, but:

  1. there’s a progress bar indication that’s super helpful: it lets the person estimate remaining time and allow to go afk do something else in the meantime.
  2. in the large scheme of things this is not a catastrophe. It’s an action that is run perhaps once in a week or once after a significant data dump. If speed ever became an issue, I would rather optimize around the idea of not running checks on the rows we’ve already checked and merged but only on the recently added ones instead of trying to find a better algo to run over the whole dataset over and over.
1 Like