?

Log in

No account? Create an account
SearchTools Blog
Indexing web pages with Solr, like magic 
25th-Feb-2010 12:16 pm
searchtools.com
Once upon a time, not so very long ago, the only people who could get Lucene/Solr to index HTML and plain text files were wizards skilled in the ways of compiler compatibility, library dependencies, and the dreaded make.

Fear no more! As if by magic, it's now quite simple.
  • Firstly, the Solr project has incorporated the Apache Tika code, which can open and read text from most of the most popular file formats, including plain text, XML, PDF, Microsoft Office formats, and even HTML. Tika can do more, it's a content analysis toolkit, but for Solr purposes, it's the opening and reading that matters.

    The Solr interface for Tika is Solr Cell (in the source code, ExtractingRequestHandler), and it just works. You call Solr update with the addition of the RESTful path /extract, give it file name, a few parameters, and zing! it's indexed. If you use a corresponding schema, not only is the text indexed, but internal metadata (like title tag) and external metadata, (like file name and size), are also stored as fields which are can be indexed, stored, and searchable.
  • Secondly, I have written a tutorial on exactly how to use Solr Cell to index text and HTML files, using the curl command line utility. It walks through these steps:

    • changing the example LucidWorks or Solr Tutorial schema to store full text
    • test-indexing a local XML file
    • indexing a local text file
    • indexing all text files in a folder
    • indexing a local HTML file
    • Indexing a remote HTML file as a web page.

      and has lots of suggestions on where to go next.
The tutorial is free to everyone, thanks to Lucid Imagination who paid me to write it. It is on their site in the solutions section as Indexing Text and HTML Files with Solr (registration required). And thanks to the Solr Cell committers, who made everything so much easier than before.
Comments 
26th-Feb-2010 05:40 pm (UTC)
Great! I personally gonna switch to Solr at some point. Then, your tutorial will be very useful. Thank you for writing it.
22nd-Mar-2010 06:36 pm (UTC)
I enjoyed it -- please tell all your friends :-)
22nd-Mar-2010 07:44 pm (UTC)
If it is a success, I certainly will. We should see. It looks like it both efficient and highly customizable.
22nd-Mar-2010 01:05 pm (UTC)
Anonymous
It is indeed a refreshing change that average developers will be able to implement such functionalities.These latest advancements allow us to explore the indexing technique comprehensively.I did try Solr and found it decent in working.Its indexing techniques can be reviewed at

http://www.lucidimagination.com/search/?q=indexing
22nd-Mar-2010 06:41 pm (UTC)
Thank you for your appreciation.
11th-Feb-2011 11:40 am (UTC) - HTML indexing in SOLR
Anonymous
hello everyone... I m a layman in terms of Solr n have very little knowledge of indexing in Solr, I can index an xml file I need ur guidance to index an html page with hyperlinks in it.. pls guide me through the prerequisites that I need to have inorder to crawl an html page in solr in an elobrated way. THANK YOU.
16th-Feb-2011 12:08 am (UTC) - Re: HTML indexing in SOLR
I would recommend checking out http://constellio.com for an open-source implementation of Solr which comes with a web crawler and admin interface. That's the best way to come up to speed quickly.
This page was loaded Oct 20th 2018, 10:15 am GMT.