Searching PDF Files
About PDF and the Web
PDF is the Portable Document Format used by Adobe Acrobat. It is designed for brochures, magazines, forms, reports and other materials with complex visual designs which will be printed on PostScript (tm) printers. The format was created to remove machine and platform dependence for the documents, and its goals include design fidelity and typographic control. It was never designed for interactive online reading. However, many word processors, page layout and other programs can create PDF files easily, so many sites are now serving them online.
Adobe has a PDF Plug-In for browsers and some development tools to allow servers to send PDF in chunks ("byte-range serving") rather than downloading the entire file. This improves the user experience of receiving PDF files, but they still lack the speed, simplicity and user control of HTML.
PDF files have a specified page size, for example, and do not reflow in smaller windows, so people with small screens spend a lot of time scrolling around the window. In addition, copying text from a PDF file is very difficult, as sidebar text is included, and selections cannot cross page breaks.
If at all possible, you should serve both HTML and PDF versions of files, designing the HTML for onscreen use and searching, and PDF for printing only. That provides your users with the best format for their task, rather than making too many compromises on one side or the other. HTML files are better for searching as well!
As of July 2003, Jakob Nielsen has an article about the problems of reading PDF files online, including results from usability tests and quotes from users who dislike this format intensely. Everyone seems happy to print from PDF, just not read it on their computer. In his follow-up article, he recommends either generating an HTML version or at least providing a "gateway" page, which includes a summary a warning that clicking the link will bring up the PDF file, and the page count and file size. In addition, he recommends that sites avoid having either internal or external search engines index the PDF files.
PDF to HTML Conversion Tools
- Adobe's online PDF Converter will convert PDF to HTML, for free, one file at a time.
- pdftohtml is free open source converter that runs on the Unix command line, it is sometimes incorporated by search engines as part of the indexing process.
- Xpdf is free open source software, includes a viewer and components for parsing PDF documents.
- Adobe recommendsBCL Magellan which converts PDF files to HTML, preserving the structure of the page, graphics, lines, hyperlinks and so on.
- Clickcat-P2H is another converter program, which offers a downloadable trial version and various special features.
- Very PDF.com's PDF2HTML application can do on-the-fly or batch conversions, also has a free trial version, and source code available.
PDF and Web Site Searching
As mentioned above, PDF files are hard on search engines, and HTML pages are much easier for them to deal with. However, if you must have PDF, please follow these procedures.
Preparing PDF Files for Searching
- Make sure each PDF file has correct document properties, especially the title. An incorrect title makes it difficult for a person viewing search results to tell if this file is useful to them.
- Check the PDF file format version number and make sure your search engine can read that version. Acrobat 5 uses the PDF 1.4 format.
- If possible, break long PDF files into smaller single-subject files, such as book sections, chapters or even chapter sections. That way, no one will accidentally download a very long document just because a word has been matched.
PDF and Metadata
Metadata is defined as "information about information". For simple search engines, that generally constitutes the document title, description, keywords, file size and modification date, but it can be much richer than that, providing many more ways to describe an object, and to search for that object. For more information, see the SearchTools Report on Metadata
When search tools index PDF files, they can get the text from the PDF information fields, such as a document title and additional keywords. If the document creator didn't enter that information, the indexer may attempt to generate a title, or may just use the file name of the document.
With Acrobat 5.0 and new releases of other products, Adobe is supporting a new eXtensible Metadata Platform (XMP, previously called XAP). This allows the files to contain substantially more information about themselves, including Dublin Core data such as author, description, actual modification date and so on. This has not been widely used and we know of no search engines that take advantage of this metadata.
Common PDF File Indexing Problems
PDF files usually have both text, and graphical representations of the text, with indications of exactly where that text should be displayed. However, there are several cases where this does not work for searching:
- Documents which were scanned directly into PDF may only have the graphic portion: there may be no computer-readable text at all. These documents are not searchable.
- Documents that were scanned and converted from graphic display to digital text using OCR (optical character recognition) may have significant numbers of errors. This is more common if the original document is old or was not perfectly aligned. In this case, many search terms will not be matched although the words were in the original printed or typed text, because they were not correctly interpreted. Some search terms may be falsely matched if the OCR software incorrectly interpreted the original text.
- Documents with multiple columns which were converted to PDF by some layout programs will display correctly and contain the correct digital text, but they miss the text flow: the words don't come in the correct sequence. Therefore the search engines will fail to match phrase queries because the phrases were wrapped on the next line of the column in the original, but that relationship was not stored in the PDF.
- Documents generated by some applications will contain partial words due to hyphenation, incorrect coding of ligatures and extended characters (diacriticals and letters beyond the basic 26), and other unusual situations. These mangled words will not match queries, although the words were in the original text
Displaying PDF Files in Search Results
If you must index PDF files, there are several ways to improve the user experience. Each PDF file is a single entity, often very large, and when the searcher clicks on a link, they suddenly discover that they are downloading a file and may be asked to install a browser plug-in.
Make sure that your search engine results listing does the following
- Has an icon or text indicating that the item is in PDF.
- Displays the file size in the search results.
Some search engines, such as PDF WebSearch, will actually display some extracted text from the PDF file, and open the file to the correct location for the matched text. If you have a lot of PDF files, consider these search engines first.
PDF-Compatible Site Search Tools
Acrobat does not provide a search engine for web sites to search PDF. They discuss this issue in the page Searching the Contents of PDF files on a Web Site.
Acrobat version 6 now incorporates the Onix search engine for searching PDF content on local disks and networks. This toolkit is also available for licensing and use in other contexts, such as web site and intranet search engines.
This list is not complete, almost every search engine updated in the last few years now indexes PDF files, some using open-source converters, others attempting to perform more sophisticated and specialized text extraction.
- AltaVista Search
- Boolean Search
- Dieselpoint Search
- DolphinSearch KnowledgeBox
- 80-20 Discovery
- Elan Web Search (PDF only)
- Excalibur RetrievalWare
- Hummingbird (formerly Fulcrum)
- IBM Intelligent Miner for Text
- Inktomi Search Software (now Ultraseek)
- JXTA Search
- Microsoft Index Server
- mnoGoSearch (formerly UdmSearch)
- NextPage (LivePublish)
- Open Text LiveLink
- PDF WebSearch
has special features to download only the pages with matched text
- Phantom 2.2
- Possible bug with linking to the PDF file once the document has been shown as a match.
- re.se@rch suite
- Verity Search '97
- Cold Fusion version of the Verity engine does not search PDFs very well.
- Web Server 4D Site Search
- Webinator (additional charge, paid versions only)
- WebSTAR Search
- Windex Search
= Java = Mac = Perl = Windows = Unix
= Remote Search Hosting Services = Code libraries