Search within PDF and Images (OCR)

Lars_Kristian_Aasbrenn · January 20, 2023, 9:02am

As mentioned in the community, a reason for keeping stuff in Evernote is the ability to search for text in images and in PDF documents. It’s a great function which makes Evernote very hard to let go. If I had to pick one of them, I would go for PDF search, which I guess is less complicated to implement.
I have been waiting for this function in Coda, but just realised that it was not mentioned in the Suggestion Box.
So her it is! Vote on

Piet_Strydom · January 20, 2023, 11:27am

I don’t use this often, but when I need it, I need it bad…

Bill_French · January 21, 2023, 10:24pm

We get around this limitation with a home-grown system that was patterned after this approach. Using a custom Pack we were able to use the S3 repository as the definitive source of all PDF-based information.

I think it’s safe to say that there are many more ways to do this better and cheaper given the improvements to Pack capabilities and PDFs pushed through a ML pipeline. This guy actually had GP3 write the code to summarize and extract keywords for PDFs.

Lars_Kristian_Aasbrenn · January 22, 2023, 2:13pm

Fun to hear there are ways around, as often Coda people find. Though, this is a bit above me, sorry

Bill_French · January 22, 2023, 2:26pm

Indeed; it’s a lot of machinery. You might want to take a look at Bardeen. I think they recently added some AI features that would allow you to capture summaries and keywords from a PDF and then add it to a Coda table.

Don’t let the label on this automation component mislead you - their image-to-text feature also works with PDF documents.

With this, you could build a workflow recipe that is kicked off when viewing a PDF, and then harvests the text which could be added to a Coda table in full, thus achieving a full-text search inside Coda. It could also use any of Bardeen’s other AI components to extract keywords, summations, analytics, etc. and add those to the table row as well.

Text Blaze may also be able to do this as well, although, a mental sketch of the solution is not as obvious. Both Text Blaze and Bardeen are #no-code integration tools, so you might really enjoy them for this and many other automated workflows internal and external to Coda.

Patricia_Hoffmann · May 9, 2023, 12:01pm

OMG! It would be awesome to have this feature! I voted for it

Bill_French · May 9, 2023, 6:09pm

Something about this says it will likely be made obsolete with multi-modal AI support.

Patricia_Hoffmann · May 15, 2023, 9:12am

Have you found an easy solution so far?
My used case is the following: I have a table with lots of PDFs. I want to search within the PDFs for certain keywords to find the PDFs that contain those keywords.

Patricia_Hoffmann · May 15, 2023, 12:02pm

@coda_account Is there anything planned/in the development pipeline that will allow to search within a PDF or rather do a global keyword search on a table with many pdf files?

Lars_Kristian_Aasbrenn · May 16, 2023, 7:29am

There are two ways text can exist in PDF’s: As real text and images of text.
For real text there is a kind of work-around by addling all PDF’s to a table, then:
Make a Text column: “Content”, Open PDF, copy all text (select all+copy), paste in into Content.

Patricia_Hoffmann · May 16, 2023, 10:35am

This always has to be done manually though, right? There is no way to automize that the text of a PDF will automatically be pasted into a column, like “Content” or?

Lars_Kristian_Aasbrenn · May 16, 2023, 8:18pm

Not with Coda or any Packs as I know. But the manual process is actually not that bad, only takes a couple of seconds:
Open PDF in the table, Press CTRL+A, CTRL+C, Close PDF , click the column beside, CTRL-V

Again, as long as the PDF has actual text and not image.

Bill_French · May 17, 2023, 1:12am

You need to dream bigger. What if you know some keywords that are similar, but not the ones that are present in the PDF? You’re describing a search system of the previous century. You need to be able to state queries with greater abstraction and still have it work.

With AI, this is also known as “text”.

This is a terrible approach.

There is. More on that in a moment.

It’s literally that bad. It’s also brittle.

It misses the texts in tables, charts, figures, call-outs and footnotes. It is an extremely weak search corpus if built this way.
When 20 of the 80 PDFs in your library are updated, how will you know which ones to repeat this manual update process.
When your PDFs triple in number, how will anyone perform this manual task with consistency or in a timely fashion?
Have you tried copying the text of a multi-column document or one with embedded figures that wrap texts? It’s a mess, and this will lead to formatting issues that reduce findability. The time required to do this well is not seconds. It’s minutes and likely lots of them.

What’s the Remedy?

Coda is (or should be the remedy). Building search systems by anyone except the platform vendor is challenging to say the least. There are security issues, cross document sharing issues, and a variety of latency issues, not to mention where does the index live? However, AGI may offer some relief. I can think of three ways for businesses to overcome this issue. Here’s one…

Imagine an AI process that reads your PDFs and converts all the content to plain text. Further, it allows you to dump all sorts of documents into a sausage grinder and out the other end comes a ChatGPT-like application. That application can be embedded in any Coda document. And, as your PDFs change, just upload them and the entire system uses the latest information; that actually does take seconds. Lastly, this approach has an API, so you could build a search UI, or a reporting system, or integrate it’s AI capabilities with other systems. If pervasive discover, understanding, and full utilization of PDF resources truly matter to the health and competitive posture of your business you should get a free trial account and prove this approach to yourself.

This is CustomGPT. It’s pricey (like $100/mo) but very powerful.

Two other approaches come to mind and parts of these ideas I’ve already experimented with. One leans on an inverted search index (designed like ElasticSearch) and the other uses a similar approach to CustomGPT, but with Google’s new PaLM 2 LLMs.

And you probably thought I was all hat, no cattle! This is not my first search rodeo.

No time to explain - gotta dash.

Patricia_Hoffmann · May 17, 2023, 9:46am

Wow @Bill_French I’m impressed!
I just looked at the CustomGPT and it seems like a super powerful tool - but yep, pricey.
Could you share some more details on the two approaches you’ve designed? It sounds and looks awesome!

Bill_French · May 17, 2023, 11:50am

Indeed, on paper. In practice, probably not so much. What’s the value to an organization that can give its workers the ability to find anything or ask questions and get detailed, thoughtful answers? What’s the value of answer to questions that are presently impractical to parse?

Enterprise workers (on average) spend almost three hours a day probing information systems to get their hands on the data needed to do their work. What if you cut that by just a third. For a five-person company, it’s about 950 hours a year. At a fully burdened cost of $65/hr, that’s about $60k. Is that $1,200 SaaS investment still too much?

One is the inverted index, a web service that probes Coda docs nightly and builds an index using LUNR. It’s a lot of work to build one and it’s not ideally suited for integration into Coda.

The other is an AI solution that uses text embeddings and completions to formulate a similar approach to CustomGPT. This is also not ideal because Coda, as you may know, has not exposed the complete text of documents through its API.

Inaccessibility to entire document texts leaves any third-party project whose intent is to help you find stuff, without access to all the stuff, a non-starter in my view. As long as this is the case, both of these integrated approaches are unlikely to provide comprehensive findability and discovery.

In as much as I believe any manual processes are brittle and prone to failure, exporting all Coda documents as PDFs and uploading them into CustomGPT appears to be the best we can hope for. But the payoff is significant if you can transform your data to take advantage of AGI.

Jean_Pierre_Traets · May 17, 2023, 7:22pm

This has nothing to do with Coda, but found it valuable in context of the above discussed.

Note, I am not affiliated, neither I have knowledge on the details.

Keep on dreaming, as the sky isn’t the limit

Rickard_Abraham · July 16, 2023, 9:55pm

This would be amazing to have natively in Coda! But for now, I can at least assist with the reading of text in PDFs part with my PDF Pack

A true OCR pack would be super cool, it seems to be pretty computationally expensive though

Rickard_Abraham · October 13, 2023, 3:00am

This birthed the idea that became my new OCR Pack

Topic		Replies	Views
Searchable PDF (or keyword extraction)	5	1016	October 10, 2023
Can Coda AI do this? (Images, PDFs, files...)	5	309	August 15, 2024
Make PDFs in a table searchable Marketplace	6	925	July 12, 2023
Coda AI Point to PDFs	12	995	October 11, 2023
Search in all all Suggestion Box	5	626	July 10, 2020

Search within PDF and Images (OCR)

What’s the Remedy?

Related topics