May 2nd, 2002

Product Report: Harvest

This is an off-site copy of the corresponding Product report page on the website, and it is designed to allow you to comment on the product and/or the reporting. For more information about the topic of search and tools visit where you can browse many articles, in-depth analysis and overviews of external resources.


Price: Free
Platforms: Windows NT 4, Unix

Usenet newsgroup: comp.infosystems.harvest

Harvest is an integrated set of tools to gather, extract, organize, search, cache, and replicate relevant information across the Internet. With modest effort, users can tailor Harvest to digest information in many different formats from many different machines, and offer custom search services on the web.

Harvest: Current Versions

Simon Wilkinson has re-energized the Harvest project, rewriting the spider in Perl. It gathers the documents, summarizes them and puts them in a database for later indexing and searching. The Robot features metadata gathering support and will offer resource descriptions in both Harvest gatherd format (SOIF) and the W3C's RDF. Information at Tardis

Kang-Jin Lee is continuing development of other parts of Harvest in C, and distributes the newer versions via Harvest at SourceForge

Harvest-NG is also on SourceForge, also known as WebHarvest.

Yet another version, created by Richard Leon Stajsic, compiles properly under Linux and marks the matched words with red text. It's distributed via ftp:

Harvest Original

Funding for the original Harvest project ended August 1996, and the project is now officially over. It was the basis for Netscape's Catalog Search server and may also be the basis for Compass Search. Parts of Harvest remain in WebGlimpse, and other parts are being commercialized as NetCache.

Articles (see also report on SOIF and RDM)

  • Comparing Open Source Indexers Infomotions Musings; May 29, 2001 by Eric Lease Morgan
    Describes the history and features of eight open-source search engines, freeWAIS-sf (aging code and hard to install, but good for searching email and public domain etexts); Harvest (powerful gathering features for frequently-changing data stores, good with structured documents); ht://Dig (tricky to configure, no phrase searching, automatic stemming and match word highlighting); Isearch (weak documentation and support, easy to install, dated interface, Z39.50 support); MPS Information Server (zippy indexing of both text and structured data, Z39.50 support, Perl API, limited documentation); SWISH-E (simple to install engine, CGIs in Perl and PHP still beta, good for HTML pages, recognizes new META tags, sorts results by field; WebGlimpse (easy to install and configure, requires commercial version for customized output); Yaz/Zebra (mainly Z39.50, no Perl API, mainly a toolkit to index and respond to distributed client queries). Article also points out that chaotic information is less than helpful and encourages organization, structure and vocabulary control.

  • Indexing, indexing, indexing by Eric Lease Morgan (April, 1998)
    Explains text and HTML search tools in the context of library use, with a short history of indexing technology, and short reviews of Harvest, freewais-sf, SWISH-E and ht://Dig.

  • Harvest Crossroads: September 1995 by Sarah Burcham
    Describes the Harvest system structure, and its ability to reduce network traffic and server load and index space requirements, compared with WAIS.

  • The Harvest Information Discovery and Access System Proceedings of the Second World Wide Web Conference, 1994 by C. Michael F. Schwartz, et al.
    Short description of the early implementations of the Harvest system.