Extract data from complex JSON with regular expressions

This is more of a lesson in efficiency. Inspired by this post:

Task: there’s a JSON with extremely nested structure. Need to extract all values for the key "text" anywhere in the JSON tree.

The first instinct would be to parse the JSON into objects and try to traverse the tree somehow. However, in this very scenario, "text" entries don’t all appear at the same level of nesting. So traversing and collecting them would require composing some complicated logic in Coda.

Efficient solution: treat JSON as plain text and simply extract pieces of data with regular expressions:

MyJSON.RegexExtract('(?<="text"\ *:\ *")(?:\\"|[^"])*', "g")

Explaining regular expressions is well beyond the scope of this topic. But knowing them is a very useful skill. There are lots of tutorials on regular expressions, and websites like regex101.com that let you compose and test regular expressions online with explanations.

What this regex above means is:

  1. Read the text until you find a spot just after "text" : " (spaces are optional)
  2. Now read the text until you find the closing quote " (and don’t stop at escaped quotes \").
    This will be the string we want to extract (i.e. the value after the "text" key)
  3. Repeat this for the rest of the text (as instructed by the ‘Global’ g flag)

RegexExtract is an experimental hidden formula. It is handy because it returns a List of strings, which you can then process, e.g. unescape newlines and special chars, and join all strings together with newlines:

thisRow.[My JSON].RegexExtract('(?<="text"\ *:\ *")(?:\\"|[^"])*', "g").FormulaMap(
  CurrentValue.RegexReplace("\\n", Character(10)).RegexReplace("\\(.)", '$1')
).Join(Character(10))

It is possible to achieve the same logic with RegexReplace to avoid using an experimental formula. But that would be a bit trickier and less of a presentation in how to use outside-the-box approaches to solve one’s problems more efficiently.

Example (with a very basic JSON really — the above would work with a much more nested tree too):

3 Likes

Paul,

A very clever use of internal Coda capabilities. As I read this, it occurred to me that you can learn so much from this one hack. It reminds us all that with RegEx, we can transform almost any cul-de-sac into a thru-street. This is a nicely crafted approach that has deep educational value.

However, another way to describe this is potentially a Brittle Solution. :slight_smile:

One of the most important aspects of JSON is the ability for tools and services to make changes without breaking pre-existing dependencies. I think it’s fine to create hacks where there are no other options. But ideally, a hack like this may encourage the Coda developers to consider adding a JSON parser along with the ability to retrieve (or POST) JSON data effortlessly to and from other services.

.

.

1 Like

Thanks!

Not sure what’s so brittle about this though. It will keep on working even if the tree structure changes (e.g. it becomes even more nested), whereas e.g. path-based queries would fail. So this solution may in fact be a Sturdy one :slight_smile:

There is a JSON parser in Coda: ParseJson(json, path)

But again, it either parses the whole structure into objects and lists, or a predefined path (or a list of paths IIRC). It would be awesome if the path parameter supported queries like JSONPath — but it feels like it doesn’t:
image


But you’re right. The point of this exercise is to remind Coda makers about the power of regex, and that it often may be a good idea to explore non-trivial solutions to tricky problems that come.

1 Like

Ha ha! Shows you what I know! :slight_smile: I’m old - I get it wrong a lot.

I used the phrase “potentially brittle” because anything in the scraping realm tends to have risks. But everything has risks as you point out.

For example, what happens when test contains 30 items? Or when the collections array in test is expanded to include additional arrays?

Your regex may continue to work, but I’ll bet there are changes that could create some turbulence where no such impact would be felt by the use of the parser.

I was so happy to see RegEx in Coda when I first used the product. It is the Swiss Army knife for getting us out of jams. It makes me look like MacGyver, well feel like MacGyver.

Nothing would change; regex doesn’t care about the {}s and []s in the text. It is set up to only match literally the longest subsequence that doesn’t contain an unescaped quote, which comes right after a "text" : " sequence

Or, in other words, a value for the text entry anywhere. All of them.

But I get what you’re saying. What if we only need some of the text entries and not all of them. Then yeah, discriminating with regex would be much harder.

Hi @Paul_Danyliuk, thanks again for your detailed explanations and the reference to regex101.com

The approach is similar to this post:


While yours is tailored to the JSON code

Just from a curiosity point of view, I will try to find the soonest some time just to play around because as always you keep inspiring me with your creativity and search to bring things down to a more “basic” level, although even the “basic” level is not always that easy for me to get my head around.

:large_blue_diamond: Different age and background, but still eager to learn new dimensions :large_orange_diamond:

Right. That’s kind’a separate from the potential of a breaking change that stops it from working. Once you get to a certain point, element filtering (while possible with RegEx) becomes quite tedious and JSON parsers come to the rescue.

But the idea of exploiting RegEx is the real story here - this is a very powerful feature that we need to know and it seems we never fully realize its potential to solve parsing challenges. RegEx is one of the foundations of building smart documents and solutions - it is literally at the doorstep of AI.