June 7th, 2004

searchtools.com

Product Report: Nutch

This is an off-site copy of the corresponding Product report page on the SearchTools.com website, and it is designed to allow you to comment on the product and/or the reporting. For more information about the topic of search and tools visit SearchTools.com where you can browse many articles, in-depth analysis and overviews of external resources.

Nutch (open source web-scalable search engine)

Information

Platform: Java (Tomcat)
Price: free

</searchtool>

Nutch is a project headed by Doug Cutting (formerly of Apple VTwin, Xerox PARC, Excite and Lucene) to make an open-source search engine expandable enough to index the entire web. It can also be used for smaller projects such as site, multi-site and intranet searching. It includes a Java crawler, and an indexer and search engine based on the Lucene open source search code library. This project is in development, partly supported by Yahoo Research and the Internet Archive, and is not complete.

Features

  • Robot crawler, can use proxy
  • Includes hosts via grep, exclusion by host names and suffixes
  • Continuous indexing
  • FTP indexing login option
  • Index logging options
  • Flexible query parsing.
  • Includes link-analysis module (mainly for multi-site search)
  • Includes approximately fifteen relevance quality adjustment options.
  • Caches original page for display.

Articles & Reviews

  • Doug Cutting Interview : May 28, 2004, by Philipp Lenssen
    Discussion of search engine architecture, Nutch, Lucene, open-source search engines, web search, spamming, speed, and the future of search.


    Building Nutch: Open Source Search : April 2004, by Mike Cafarella and Doug Cutting
    Describes the issues and challenges of designing a hugely-scalable search engine,
    advantages of open-source projects, descriptions of spam techniques and responses, and cost-effectiveness.

  • Free Search (Doug Cutting's Blog)
    Interesting issues of search and the web.

Examples