February 27th, 2009


Tika: open source access to text in many formats

Search engines need text to index: this may seem obvious but the devil is in the details. Extracting text is easy when working with txt, html or xml files, but much more difficult for binary files, including MS Office and archive formats. So search indexers need to use file format parsers, also called "filters". These can access the binary file formats, extracting the text and keeping track of whatever structure is there. Some file parsers are better than others, and all of them may need updating: as Microsoft switched from the proprietary format to their xml (doc -> docx), the search indexers need updated filters to read the new formats.

Tika is the Lucene open source project for calling format parsers and returning the result as XHTML. It's a well-designed standardized interface, making use of existing open source file parsers, including Apache POI for Microsoft Office documents (old and new), and PDFBox for Adobe Acrobat-type files. There are a dozen other formats already supported, and the API makes it easy to add a custom file parser without having to write any special code in the indexer.

But Tika is not limited to Lucene and related projects. Because it's open source, any search engine indexer can use it to access file parsers (within the limits of the Apache license). This simplifies everyone's life considerably, and creates a framework for open-source file parsers that is stable and documented. People can work on improving the code of the file parsers, or write their own and know that it will be compatible. There are other open-source file parsers, but the Tika framework and toolkit are likely to be dominant as long as they keep working.

On the commercial side of things, there are two main packages which are included with almost every enterprise search system. These are Outside-In (acquired by Oracle) and Keyview (acquired by Autonomy). Microsoft also has one they use on Windows. I believe that these packages have active development and support. These packages have both input and output APIs, so customers can create additional custom file format parsers. I don't expect much change in these packages,

For historical notes, technical details, and some interesting context, see Where Have All the Filters Gone? by Mark Bennett (from June 2007)
  • Current Mood