February 25th, 2010


Indexing web pages with Solr, like magic

Once upon a time, not so very long ago, the only people who could get Lucene/Solr to index HTML and plain text files were wizards skilled in the ways of compiler compatibility, library dependencies, and the dreaded make.

Fear no more! As if by magic, it's now quite simple.
  • Firstly, the Solr project has incorporated the Apache Tika code, which can open and read text from most of the most popular file formats, including plain text, XML, PDF, Microsoft Office formats, and even HTML. Tika can do more, it's a content analysis toolkit, but for Solr purposes, it's the opening and reading that matters.

    The Solr interface for Tika is Solr Cell (in the source code, ExtractingRequestHandler), and it just works. You call Solr update with the addition of the RESTful path /extract, give it file name, a few parameters, and zing! it's indexed. If you use a corresponding schema, not only is the text indexed, but internal metadata (like title tag) and external metadata, (like file name and size), are also stored as fields which are can be indexed, stored, and searchable.
  • Secondly, I have written a tutorial on exactly how to use Solr Cell to index text and HTML files, using the curl command line utility. It walks through these steps:

    • changing the example LucidWorks or Solr Tutorial schema to store full text
    • test-indexing a local XML file
    • indexing a local text file
    • indexing all text files in a folder
    • indexing a local HTML file
    • Indexing a remote HTML file as a web page.

      and has lots of suggestions on where to go next.
The tutorial is free to everyone, thanks to Lucid Imagination who paid me to write it. It is on their site in the solutions section as Indexing Text and HTML Files with Solr (registration required). And thanks to the Solr Cell committers, who made everything so much easier than before.