Search Faster in Large PDFs

Based on an article originally published in the DesignGeek e-zine.

Does this sound familiar? You open a huge PDF with no bookmarks and no linked TOC, and you need to quickly find the page containing the topic you’re interested in.
Your Google-ized instincts immediately reach for the Find (Command/Control-F) field to enter the word or phrase you’re looking for. Acrobat (or Adobe Reader, doesn’t matter) finds the first couple of instances in a reasonable amount of time, but soon it slows to a crawl as it hits a dry patch.
The little read-out says, "Searching 342 of 575… 343 of 575… 344… 345… 346…" Two minutes later and you’re still staring at the page progression, hypnotized, waiting for a hit: "517 of 575… 518… 519… 520…"
Agh! Snap out of it, man!
By choosing one little command in Acrobat Pro version 8, you can put an end to this misery and instantly find things in even in the most massive of PDFs.
Embed an Index
Using Acrobat Pro, you can create a full-text index of the contents of a single PDF and (new to version 8) embed it into the PDF. Then when you Find or Search, Acrobat or Reader searches the index, not the PDF. Since the index file is much smaller, operations are lightning-quick. And, since the index knows which page numbers words appear on, the end result is the same.
We’ve been able to create indexes in Acrobat Pro for many versions now using the Catalog feature (Advanced> Document Processing> Full Text Index with Catalog) (Figure 1).

Figure 1. The Catalog feature has been in Acrobat for several versions.
PDF content providers typically index a folder full of PDFs so that a single Search (Command/Control-Shift-F) can hunt down the search text in a whole collection of PDFs. And I suppose you could use Catalog to create an index of a single PDF too, though I never bothered.
All that is still possible in Acrobat Pro 8, and the old ways of associating an index with a particular PDF still work.
But here’s Acrobat Pro 8’s new twist: Indexes are embeddable in a PDF. Once they’re embedded, you no longer have to keep track of the separate .pdx and .idx files generated for each PDF’s index, or make sure that they always travel with the file. End users don’t have to figure out how to tell Reader to use the index during Finds and Searches, since Reader 8 and Acrobat 8 automatically use it if it’s embedded. (Earlier versions of Reader and Acrobat ignore the embedded index.)
Cool, huh? Best of all, it’s dead-simple:

  1. Open the PDF in Acrobat Pro 8 and choose Advanced> Document Processing> Manage Embedded Index.
  2. The resulting dialog box will tell you that the "the document does not contain an embedded index." Ignore that and click the Embed Index button.
  3. An alert pops up, saying that Acrobat is about to save and close the document; build a search index for it; embed the index; and reopen the document. Click the OK button. The PDF closes, and after a few seconds of watching a progress bar create the index (Figure 2), it opens up again.


Figure 2. It didn’t take long to index this large PDF.
Before and After
For my guinea pig test file, I downloaded the InDesign CS3 "full documentation" PDF from Adobe’s Web site. This puppy tips the scales at 46.35 MB and 762 pages. Whoa, mama!
Before I indexed it, I ran a search (Edit> Search) for the term "blend" and timed it. On my late-model Compaq, Acrobat Pro 8 took 24 seconds to display the 153 matches in its Results window.
After embedding the index (which added 2.8 MB to the file size), and purging the Search cache (see below) to keep things fair, I ran the same search. This time, Acrobat took about, oh, a nanosecond to display the same 153 matches. I had the same blink-of-an-eye results in Reader 8, on both platforms.
You can bet that from now on, I’ll be routinely embedding indexes in all of the larger PDFs on my hard drive.
If you post large PDFs for your customers to download, such as catalogues or periodicals, you might want to do the same.
About that Search Cache
Both Acrobat and Reader already do something similar when you’re repeatedly hunting for terms in the same PDF. They cache the text and save it in a file so that subsequent Finds and Searches are faster. You can adjust the size of the cache, or purge it, in Preferences> Search.
But embedding an index in a PDF ensures that Finds and Searches are always fast in Reader 8 or Acrobat 8, regardless of the state of the user’s cache, even if it’s the first time they need to find something quickly.

Anne-Marie “Her Geekness” Concepción is the co-founder (with David Blatner) and CEO of Creative Publishing Network, which produces InDesignSecrets, InDesign Magazine, and other resources for creative professionals. Through her cross-media design studio, Seneca Design & Training, Anne-Marie develops ebooks and trains and consults with companies who want to master the tools and workflows of digital publishing. She has authored over 20 courses on lynda.com on these topics and others. Keep up with Anne-Marie by subscribing to her ezine, HerGeekness Gazette, and contact her by email at [email protected] or on Twitter @amarie
  • kvc801 says:

    I have scanned a 369-page book and created a large (40mb) PDF file. I used the OCR feature and can now search for text within the PDF file. However, I cannot create an embeded index and I don’t know why.

    When I try to create an embeded index, the message indicates that the index was embeded, so I save the file. When I open the file, the message says there is no embeded file.

    I can create an catalog index outside of the PDF and that works perfect.

    Is it possible that the embeded index doesn’t work because it is scanned text converted by an OCR?

    Any suggestions would be appreicated.

    • M Cowlishaw says:

      Seems to have to save it under a different name (Save As) .. plain Save decides there has been no change and does not write to disk.
      [I know this is later :-) .. but may help someone.]

  • Anonymous says:

    you made my day with that trick.. i have an election database pdf 75000 pages and you can imagine the time for searching a name on it and i failed in converting it to any other form “excel..etc” coz of the table format

    now it’s really fast…

    man you rock

  • John Efkolo says:

    Is any way to seach indexed files into android?

  • >