PDF Extract Pack

Hi Coda community! I wanted to share a small side project I worked on a few months ago: the PDF Extract Pack. This free Pack lets you extract text from PDF files uploaded to Coda. When combined with formulas, Coda AI, features and other Packs, you can build Coda docs that offer insights or extract specific structured data from your PDF files.

The Pack does all processing itself without connecting to external services, so that your data doesn’t leave Coda servers.

For more info on how it’s built, how to use it, and known limitations, see Using the PDF Extract Pack. Hope this helps!

12 Likes

Absolutely wonderful. Can’t wait to use it

hi @oleg ,
Thanks for sharing this pack, it feels like an important step in working with the ‘doc part’ of Coda.
I used the sample files in your demo doc and noticed that the bitcoin file got extracted using the formula extract directly, but it blocked in a button when it was part of a modifyRows() to save the content in a canvas column. It blocked because of what you see below. This seems unrelated to your pack and directly to the function ModifyRows() and it is the first time I see it. I thought it was worthwhile to mention for other users.

questions:

  • What is the limit for Coda to extract, any idea?
  • Could you add a function for reading the pages in the pdf? When I have some contracts I don’t want to open all of them, but when I notice long contracts I can create a kind of warning and break up the process

Merci, Cheers, Christiaan

So this is subject to change and depends on other factors like language and encoding, but you might start to get these errors if a modification is over ~85 kb in size (85,000 characters). That should be sufficient for most purposes, but is probably smaller than the contents of a full PDF research paper.

I’d suggest using the firstPage and lastPage optional parameters of the Extract() formula to break up the PDF into smaller parts (you can use Info() to see the number of pages in the PDF), or you can try using ExtractFull() as well to go chunk by chunk (depends on how the PDF defines its text areas). You can then update your button to make smaller modifications sequentially by using a combination of RunActions() and FormulaMap() to insert the contents of individual pages one by one.

hi @oleg ,

thanks for coming back to me.
The curious thing is that when using the function in a text column the bitcoin paper is extracted fully, however not with a modifyRow() function.
I was unaware of the info operator, that will help to get an idea.
As said before I am glad we get more tools to work on the document side of the application and I am looking forward to see more tools coming to process text in an intelligent manner.
Cheers, Christiaan

So the difference is that when using the formula to read the PDF, the Pack is just returning output and the limits for that are higher.

When using a button or automation (in tandem with ModifyRow()), that creates an operation, with the contents from the PDF passed through and written into the operation itself - and that has the smaller ~85 kb limit.

1 Like