September 19th, 2002

searchtools.com

Product Report: ASPseek

This is an off-site copy of the corresponding Product report page on the SearchTools.com website, and it is designed to allow you to comment on the product and/or the reporting. For more information about the topic of search and tools visit SearchTools.com where you can browse many articles, in-depth analysis and overviews of external resources.

ASPSeek

Product Information

Platform: Linux, BSD (compiled but not tested on AIX and Solaris).
Price: free, open source, under GNU GPL

Features

  • Open Source, written in C++ using STL, some use of SQL database for storage
  • Command line / config file search administration
  • Can index multiple sites at the same time using threads and asynchronous DNS lookups
  • Option to show indexing progress (-N)
  • HTTP and proxy HTTP and FTP indexing
  • Search engine operates while index is updating.
  • Option for near-real-time index updating
  • Will index documents protected by SSL using HTTPS
  • Supports basic authentication (user name and password)
  • Indexes HTML and text documents
  • Requires external programs or scripts to index other file formats
  • Language support includes Unicode for mixing character sets in index, charset guessing, and language mappings for Roman characters including Czech, Danish, Dutch, English, French, German, Italian, Norwegian, Polish, Portuguese, Russian, Slovak, Spanish, Turkish, Ukrainian and non-Roman: Arabic, Greek, Hebrew, Japanese, Chinese (BIG5 and Gb2312) and Korean.
  • Duplicate detection
  • Very scalable, to several million documents
  • Zone searching (limit to a site or section of a site)
  • Standard and advanced search capabilities, including phrase search, Boolean queries and wildcard searches.
  • Spellchecking with ispell
  • Optional stemming for search results.
  • Hit highlighting in search results
  • Weight given to inbound links (pointing at a page) in relevance ranking
  • Local caching of indexed pages
  • Easy to customize results pages
  • Can cluster search results by site
  • Some code based on mnoGoSearch, but they have taken different paths

Examples

searchtools.com

Product Report: ht://Dig

This is an off-site copy of the corresponding Product report page on the SearchTools.com website, and it is designed to allow you to comment on the product and/or the reporting. For more information about the topic of search and tools visit SearchTools.com where you can browse many articles, in-depth analysis and overviews of external resources.

ht://Dig

Product Information
   Mailing List Archive

Price: free, open source under the GNU General Public License
Platforms: Solaris, HP/UX, IRIX, SunOS, Linux, Mac OS X, Mac OS 9 (from Tenon)

Potential Security Flaw, 3.1.0b2 through 3.1.5 and 3.2.0b3 users should upgrade

Features

  • Free and open-source, written in C++
  • Extremely helpful user group
  • Designed for source-level modification and customization
  • ConfigDig: a template-based HTML front end for easy search administration from any browser. This allows remote configuration by search admins who are not expert at Unix command-line interfaces.
  • Can handle multiple sites and over 100,000 pages
  • Index spider is quite robust and handles error conditions gracefully.
  • Metadata indexing is configurable, easy to add Dublin Core DC tags.
  • Can index PDF, MS Word, PowerPoint (see tips)
  • Many options for indexing and searching non-exact matches, including stemming, soundex and fuzzy matching.
  • Note: version 3.2 will have phrase matching, current version only does AND or OR Boolean search.
  • Searching on field or metadata contents not yet implemented.

Versions

  • Version 3.2 still in beta test
  • 3.1.6, October, 2001: fixes a potential security flaw, 3.1.x users should update to this version.
  • 3.1.5, February 2000: fixes bugs including a serious security-related bug in all previous releases.
  • 3.1.3, September 1999: fixed additional bugs, META robot parsing, compound-word handling, etc.
  • 3.1.2, April 1999 (RedHat updates August 1999): improved Acrobat PDF compatibility, META description tag display, many bug fixes including Y2K improvements.
  • Version 3.1 released February 9, 1999.

Articles & Reviews

  • The Open Road: Using ht://Dig UnixReview, April 2002 by Joe "Zonker" Brockmeier
    Part 1 is a short but helpful discussion of how the indexing and search work, formatting results, scheduling and configuration. Part 2 talks about tuning the search engine for speed and efficiency.

  • Comparing Open Source Indexers Infomotions Musings; May 29, 2001 by Eric Lease Morgan
    Describes the history and features of eight open-source search engines, freeWAIS-sf (aging code and hard to install, but good for searching email and public domain etexts); Harvest (powerful gathering features for frequently-changing data stores, good with structured documents); ht://Dig (tricky to configure, no phrase searching, automatic stemming and match word highlighting); Isearch (weak documentation and support, easy to install, dated interface, Z39.50 support); MPS Information Server (zippy indexing of both text and structured data, Z39.50 support, Perl API, limited documentation); SWISH-E (simple to install engine, CGIs in Perl and PHP still beta, good for HTML pages, recognizes new META tags, sorts results by field; WebGlimpse (easy to install and configure, requires commercial version for customized output); Yaz/Zebra (mainly Z39.50, no Perl API, mainly a toolkit to index and respond to distributed client queries). Article also points out that chaotic information is less than helpful and encourages organization, structure and vocabulary control.

  • I love it when a plan comes together PalmPower magazine: March 2001 by David Gewirtz
    Rambling but cheerful description of setting up a search engine for ZATZ web sites using ht://Dig, indexing only the appropriate articles and not the alternate forms or contents pages. Some digressions into robots.txt, Linux and PHP.

  • Indexing File Formats: from the ht://Dig mailing list, 19 December, 2000 by David Adams
    • ht://Dig can index PowerPoint using ppt2html, though perhaps not the 2000 format.
    • It can also index Microsoft Word documents using wp2html ("which extracts the 'subject' from the document summary and puts it in the header, which might be a problem") and catdoct (which "does often include gibberish in its output, and you could find removing the -b option in the call of catdoc an improvement.")
    • Doc2html.pl uses pdfinfo to extract the title of the .PDF file, and I have seen .PDF documents where the title is 'þÿ ' for some reason. You might need to modify doc2html.pl to suppress such titles.

  • Search Engines: The Hunt Is On Network Computing Magazine: October 16, 2000 by Avi Rappoport
    In-depth discussion of search engines for e-commerce and other web sites covers features and future trends, software vs. services, database vs. text searching, natural-language searching, and open-source search engines covering ht://Dig and mnoGoSearch.

  • Search This! Developer Shed, March 15, 1999 by Colin Viebrock
    Helpful hints and information about installing, configuring, indexing, searching, and displaying results, specifically for those running PHP servers.

  • ht://Dig: Recognized META information How to set ht://Dig to recognize meta keywords, email addresses and other information.

  • ht://Dig survey results
    List of more than 35 ht://Dig installations, including number of servers, of documents, of words; update frequency; number of hits per day; index size, primary use (intranet, educational, etc.), and problems.

  • Entscheidung gegen den Compass Server (in German) late 1998, minor updates 1999, by Walter Hafner
    Evaluation compares Compass with ht://Dig, and PLWeb. Praises Compass for browser administration system, filtering, virtual hosts, and realtime monitoring: downsides are proprietary database format and price. PLWeb has some text configuration files, every efficient customization, good indexing speed and scheduling, multiple indexes can be created, and it is free; however it had problems with multiple servers, domains and virtual hosts. Ht://Dig is open source and free, but can be hard to configure, slow to index and search large indexes, and provides minimal monitoring features.

  • Web Developer.Com Guide To Search Engines Wes Sonnenreich and Tim MacInta: John Wiley & Sons, February 1998, ISBN 0471246387 $34.99.
    A wide-ranging book covering everything from the beginnings of the robot spiders crawling and indexing the web to analysis of the major webwide search engines to detailed information on installing and configuring six local site search tools. The programs covered are AltaVista Search Intranet, Excite for Web Servers, Harvest, ht://Dig, Phantom and Ultraseek. Also describes BDDBot: An ongoing collaborative project, to create a Java web server and search spider, using open source under the GNU public license. Use these links to buy from Amazon or FatBrain and you'll support this site.

  • Summary of Site Search Packages WebReview Nov. 21, 1997
    Chart covering Excite, Microsoft Index Server, ht://Dig, Verity Search97, Netscape Catalog and SWISH.

  • Web Site Search Engines (Appendices) The PIPER Letter: November, 1996
    Covers Basis Webserver , Excerpt, Excite, Folio, freeWAIS-sf, FrontPage, Fulcrum, Glimpse, ht://Dig, Ice, Isearch, PL Web, Swish, TEAMate, WebFind/WebIndex, and WebSite.

Examples