Search within PDF and Images (OCR)

You need to dream bigger. What if you know some keywords that are similar, but not the ones that are present in the PDF? You’re describing a search system of the previous century. You need to be able to state queries with greater abstraction and still have it work.

With AI, this is also known as “text”. :wink:

This is a terrible approach.

There is. More on that in a moment.

It’s literally that bad. It’s also brittle.

  • It misses the texts in tables, charts, figures, call-outs and footnotes. It is an extremely weak search corpus if built this way.
  • When 20 of the 80 PDFs in your library are updated, how will you know which ones to repeat this manual update process.
  • When your PDFs triple in number, how will anyone perform this manual task with consistency or in a timely fashion?
  • Have you tried copying the text of a multi-column document or one with embedded figures that wrap texts? It’s a mess, and this will lead to formatting issues that reduce findability. The time required to do this well is not seconds. It’s minutes and likely lots of them.

What’s the Remedy?

Coda is (or should be the remedy). Building search systems by anyone except the platform vendor is challenging to say the least. There are security issues, cross document sharing issues, and a variety of latency issues, not to mention where does the index live? However, AGI may offer some relief. I can think of three ways for businesses to overcome this issue. Here’s one…

Imagine an AI process that reads your PDFs and converts all the content to plain text. Further, it allows you to dump all sorts of documents into a sausage grinder and out the other end comes a ChatGPT-like application. That application can be embedded in any Coda document. And, as your PDFs change, just upload them and the entire system uses the latest information; that actually does take seconds. :wink: Lastly, this approach has an API, so you could build a search UI, or a reporting system, or integrate it’s AI capabilities with other systems. If pervasive discover, understanding, and full utilization of PDF resources truly matter to the health and competitive posture of your business you should get a free trial account and prove this approach to yourself.

This is CustomGPT. It’s pricey (like $100/mo) but very powerful.

Two other approaches come to mind and parts of these ideas I’ve already experimented with. One leans on an inverted search index (designed like ElasticSearch) and the other uses a similar approach to CustomGPT, but with Google’s new PaLM 2 LLMs.

And you probably thought I was all hat, no cattle! This is not my first search rodeo.

No time to explain - gotta dash.

2 Likes