Dedupe an array of arrays based on a common key/object with formulamap()

I have a cell with several arrays in it, each holding a vendor/product/price/email

[acme, 123 main street, car, 50, contact@acme.co]
[acme, 123 main street, truck,100, contact@acme.co]
[tesla, 456 north street, car, 200, contact@tesla.co]

I want to use the array in formulamap, but filtered to dedupe based on email, so my result would be one row for acme and one row for tesla. I don’t care which record leaves since I will have my email at the end. Thoughts?

1 Like

@Johg_Ananda

What should the final result look like after FormulaMap() has run?

@Ander Ultimately what I want is:

[acme, 123 main street, contact@acme.co]
[tesla, 456 north street, contact@tesla.co]

Such that we are de-duping based on email, so it somehow filters one of the ‘duplicate’ email/rows out and I have a resulting array that has the unique email addresses but ALSO has other information, in this case the physical address is retained. The car details are specific to the rows and they are randomly omitted - which is ok in this case because I want to addresses and emails paired. Make sense?

Is that an array of arrays in one cell?

Or two cells, each holding one array?

In both examples I provided it is one cell holding the arrays (array of arrays) such that it can be used in formulamap().

And do you want to run formulamap() over two items (each of the two arrays in the array)?
***[acme, 123 main street, contact@acme.co],
***[tesla, 456 north street, contact@tesla.co]

Or do you want to run it over 6 items (three items per array x 2 arrays)?
[***acme, ***123 main street, ***contact@acme.co],
[***tesla, ***456 north street, ***contact@tesla.co]

@Johg_Ananda

Without understanding your workflow, here is one sequence of steps that can build up to something like what you’re describing (I think). Do any of these components move you forward?

hey @Ander thanks for putting that together, but I don’t think its addressing it. Take a look at this example I made and let me know if this explains it better:

1 Like

@Johg_Ananda

Added your example to the doc. What do you think about the logic that builds to the blue columns on Page 2? Not the presentation, just the logic. Any closer?

@Ander thank you very much for moving through this with me! :heart: Your document made me rethink my example and how I am presenting it to you and as such I have recomposed it. Please take a look at its current form and see if you can follow where I want to go. Thanks!! :pray:

1 Like

@Johg_Ananda

I suspect your actual schema could be optimized somewhere. That said, see how Page 3 grabs you.

@Ander is right and the only right way to address this is to restructure your data. Seems like you’ve both got this so I didn’t look too much into the solution.

But given my love for esoteric challenges, I couldn’t just walk past and not try to tackle the original problem :slight_smile:

Here’s how you can dedupe a list of lists with a live formula. Warning: uses a list of lists of lists (as a workaround for CurrentValue scope problems) :slight_smile:

1 Like

THIS DID IT!! :heart: :partying_face:

OK so the main innovation was to add the [inventory] column to dealerships, which you kind of slipped in there, but I found it. Being able to use that to provide the list of just the dealers allowed me to then combine that array with the row data to great effect! Many thanks for helping me think through it and get to the solution. My app is now VERY powerful :smiley: :smiley: :smiley:

1 Like

@Paul_Danyliuk I was wondering if this problem would attract your genius. Your solution is … amazing. Would you consider doing an audio commentary / screen recording or something moving through the steps? You lost me at the remainders … its very hard to intuit what you’re doing, especially since i can only access the formulas from this small embed (if I open in coda) the doc is view only and I can’t see formulas. I’d love to understand your process here. Thanks!! :pray:

I’ll sure post the explanation tomorrow. Meanwhile you can add ?viewMode=embedplay to the URL to be able to play and examine the doc in the new tab

1 Like

Alright, here’s an explanation (also noticed a problem with my solution while writing this post, lol, so fixed and also simplified).

Normally the solution would look like this:

  1. Get the list of unique emails as

    [List of lists].FormulaMap(
      CurrentValue.Last()
    )
    

    given that emails are last in each list.

  2. Run a simple formula map:

    [Unique emails].FormulaMap(
      [List of lists].Filter(CurrentValue.Last() = uhm… CurrentValue).First()
    )
    

    i.e. for each email (CurrentValue) filter the list of lists down to only those entries that correspond to this email and out of those matched rows select the first one.

But I guess you can see the problem already. Since the CurrentValue variable is overwritten within Filter() with a record from the list of lists, there’s no way to access the outer CurrentValue that would mean the current email to find records for.

So the trick here is to make a full product of M emails and N records and then filter/slice based on that. This way we can iterate over all combinations of inner and outer CurrentValue (because our CurrentValue will essentially be a List(OuterCurrentValue, InnerCurrentValue). And because of the cyclic nature of this new list (items 1…N correspond to 1st email, items N+1…2N correspond to 2nd email etc) it’s easy to extract and only use the portion of the list we’re interested in within this iteration.

So here’s what I’m doing:

  1. Inner and outer count. This is straightforward:
    image

  2. I’m composing a full product of two lists, i.e. all combinations of items from the right and items from the left here:


    The math is like this:

    • For M x N times (starting at zero for convenience)
    • Make a list out of two elements
    • Where the first element will be taken from the list of emails, but for the first N rows it will be the first email, for the second N rows the 2nd email and so on. That’s the RoundDown(CurrentValue / [Inner count]) plus 1 to get index back to 1-based.
    • And the 2nd parameter is a corresponding item from the list of records, repeated with a cycle of N. That’s the Remainder() formula.
  3. In the same cell (because why not) I’m replacing items where the join fails (i.e. email CurrentValue.First() doesn’t match the one in the record CurrentValue.Last().Last()) with blanks, and unwrapping the rest:


    Now we got our records aligned in an MxN matrix in such way that we know:

    • the first N items (records or blanks) correspond to the first email
    • the second N items correspond to the second email, and so on.
  4. So all that’s left is to select one record from each sequence of N items:

    • For each of M emails (and we don’t care about the emails anymore — we just know to take the next N items for M times) slice the corresponding sublist:


      i.e., items 1–3 for the first email, items 2–6 for the 2nd email etc

    • filter out the blanks

    • and finally take the First() one of those remained (could’ve taken Last(), whatever).

Hope this explains the trick. There’s not much genius in this — I think any person with some CS and competitive programming background would’ve considered this approach.

3 Likes