Log in

No account? Create an account
SearchTools Blog
Tika: open source access to text in many formats 
27th-Feb-2009 03:05 pm
Search engines need text to index: this may seem obvious but the devil is in the details. Extracting text is easy when working with txt, html or xml files, but much more difficult for binary files, including MS Office and archive formats. So search indexers need to use file format parsers, also called "filters". These can access the binary file formats, extracting the text and keeping track of whatever structure is there. Some file parsers are better than others, and all of them may need updating: as Microsoft switched from the proprietary format to their xml (doc -> docx), the search indexers need updated filters to read the new formats.

Tika is the Lucene open source project for calling format parsers and returning the result as XHTML. It's a well-designed standardized interface, making use of existing open source file parsers, including Apache POI for Microsoft Office documents (old and new), and PDFBox for Adobe Acrobat-type files. There are a dozen other formats already supported, and the API makes it easy to add a custom file parser without having to write any special code in the indexer.

But Tika is not limited to Lucene and related projects. Because it's open source, any search engine indexer can use it to access file parsers (within the limits of the Apache license). This simplifies everyone's life considerably, and creates a framework for open-source file parsers that is stable and documented. People can work on improving the code of the file parsers, or write their own and know that it will be compatible. There are other open-source file parsers, but the Tika framework and toolkit are likely to be dominant as long as they keep working.

On the commercial side of things, there are two main packages which are included with almost every enterprise search system. These are Outside-In (acquired by Oracle) and Keyview (acquired by Autonomy). Microsoft also has one they use on Windows. I believe that these packages have active development and support. These packages have both input and output APIs, so customers can create additional custom file format parsers. I don't expect much change in these packages,

For historical notes, technical details, and some interesting context, see Where Have All the Filters Gone? by Mark Bennett (from June 2007)
4th-Mar-2009 03:38 am (UTC)
Should be a nice thing. It is interesting that more and more libraries are available in Java only (not in C++ sigh)
As a side note:
Tika's url should be: http://lucene.apache.org/tika/
The link to "Where Have All the Filters Gone" should be:
http://www.ideaeng.com/tabId/98/itemId/136/Where-Have-All-the-Filters-Gone.aspx (i.e. without trailing \)
5th-Mar-2009 08:51 pm (UTC)
Thank you! I will post corrections at once.
12th-Mar-2009 06:33 pm (UTC) - Tika and MS iFilters
Good stuff Avi.

Otis G has a whole chapter in Tika in his Lucene in Action book, in the SECOND Edition. The first edition has a chapter on filters as well, but not focused on Tika.

There are a few other filters, including the Microsoft iFilters framework if you happen to be on Windows. It's not open source, but it is free.
12th-Mar-2009 10:37 pm (UTC) - Re: Tika and MS iFilters
I added some more info in the next entry but your article covered the Microsoft filters better than I could.
7th-Apr-2009 07:49 pm (UTC)
Aperture. Don't know how good it is. All of the extraction comes in RDF form, so maybe it is easier to work with.

5th-Jun-2009 06:18 pm (UTC)
That does look really good, I will add it, thanks!
5th-Jun-2009 02:44 pm (UTC) - Use OpenOffice/JOD Converter as a converter
http://www.artofsolving.com/opensource/jodconverter is an opensource tool which uses the openoffice service to convert about any office like document:

OpenDocument Text (*.odt)
OpenOffice.org 1.0 Text (*.sxw)
Rich Text Format (*.rtf)
Microsoft Word (*.doc)
WordPerfect (*.wpd)
Plain Text (*.txt)
HTML1 (*.html)
OpenDocument Spreadsheet (*.ods)
OpenOffice.org 1.0 Spreadsheet (*.sxc)
Microsoft Excel (*.xls)
Comma-Separated Values (*.csv)
Tab-Separated Values (*.tsv)
OpenDocument Presentation (*.odp)
OpenOffice.org 1.0 Presentation (*.sxi)
Microsoft PowerPoint (*.ppt)
OpenDocument Drawing (*.odg)

And when u use the latest version of OpenOffice even all Office 2007 formats are supported as well. You can convert to pdf and use pdf2html to convert to a parsable format.

Maarten Rooseboom - Qweery
5th-Jun-2009 06:12 pm (UTC) - Re: Use OpenOffice/JOD Converter as a converter
Good point, I hadn't thought of that. Though running docx through PDF seems somewhat less than optimal. Still, in a pinch... Thanks!
5th-Jun-2009 09:30 pm (UTC) - Re: Use OpenOffice/JOD Converter as a converter
well...it converts to a lot of formats, but I like the setup of converting all to pdf and that one to html...since not all format can be converted to html at once (only need one function).

27th-Oct-2009 05:16 pm (UTC) - Alternatives to KeyView Export

I'm trying to find alternatives for KeyView Export- Can anyone recommend software of this sort for enterprise level use?

12th-Oct-2010 10:05 am (UTC)
Thanks for sharing all the information! With all this being said I think I need a software to open files, I've read about it and they support multiple file formats. Did you have any experience with these?
12th-Oct-2010 09:42 pm (UTC)
That program is more matching file contents with the correct application, rather than opening foreign files for which one doesn't have the application. I'm a bit surprised anyone's selling software to do that, but there may be a need.
This page was loaded Jun 18th 2019, 9:54 pm GMT.