July 7th, 2003

searchtools.com

SearchTools Report: Taxonomies, Classification, Categorization

Taxonomies, Categorization, Classification, Categories, and Directories for Searching

</center>

The terms taxonomy, ontology, directory, cataloging, categorization and classification are often confused and used interchangeably. These are all ways of organizing information (or things or animals) into categories.

There are a number of applications that can help people create taxonomies and place information objects within their categories, although the amount of automation can vary. Some programs simply allow anyone to manually add a URL to a specific category by submitting a site. Others allow human catalogers to create sophisticated rules to specify certain words and phrases which will place a page in a category. Others accept a "training set" within an existing taxonomy, and will place documents in categories based on similarities. Still others attempt to automate the entire process, grouping pages into topics based on programmatic evaluation of the contents.

When evaluating these applications, remember that they are simply software. No matter the elegance of the algorithms, a computer program can never truly understand the concepts involved in a page, as a human can do, and will sometimes place pages in the wrong categories. For example, one very automated system had an "Arts and Humanities" category which includes links to an Internet services consulting company and a singer-songwriter's personal home page (along with many more appropriate pages). To serve your site or intranet users, plan for a significant amount of human cataloging and editing.


Glossary and Definitions

A directory is an organized sets of links, like those on Yahoo or the Open Directory Project, which allows a web site to display the scope and focus of its content. A directory can cover a single host, a large multi-server site, an intranet or the Web. At each level, the category names provide instant context information to users. Rather than a simple list, such as the results of a search, drilling down into the more and more specific categories (for example Shopping > Clothing > Footwear > Athletic) explains how the pages fit into the larger set of information.

Categorization is the process of associating a document with one or more subject categories. So the entry for a page on cross trainer shoes could go into Running, Manufacturing, Sports Medicine, or Rushkoff, Douglas! All of these are legitimate, depending on the context.

Cataloging and Classification come from libraries, where specialists enter the metadata (such as author, date, title and edition) for a document, apply subject categories to it, and place it into a class (such as a call number) for later retrieval. These tend to be used interchangeably with Categorization.

Clustering is the process of grouping documents based on similarity of words, or the concepts in the documents as interpreted by an analytical engine. These engines use complex algorithms including Natural Language Processing, Latent Semantic Analysis, Bayesian statistical analysis, and so on.

A Thesaurus is a set of related terms describing a set of documents. This is not hierarchical: it describes the standard terms for concepts in a controlled vocabulary. Thesauri include synonyms and more complex relationships, such as broader or narrower terms, related terms and other forms of words.

Taxonomy is the organization of a particular set of information for a particular purpose. It comes from biology, where it's used to define the single location for a species within a complex hierarchic. Biologists have arguments about where various species belong, although DNA analysis can resolve most of the questions. In informational taxonomies, items can fit into several taxonomic categories.

Ontology is the study of the categories of things within a domain. It comes from philosophy and provides a logical framework for academic research on knowledge representation. Work on ontologies involves schema and diagrams for showing relationships in Venn diagrams, trees, lattices and so on.


Resources

Organizational Theory section of the Argus Center for Information Architecture
A splendid annotated bibliography of the best works on organizing information and classification as well as indexing, thesaurus construction and controlled vocabularies.

Content Wire Taxonomies News
Links to relevant articles, updated several times a month.

Guided Tour of Ontology
Definitions and background information for ontologies as part of the Semantic Web.

Taxonomy Primer from Lexonomy
Discusses controlled vocabularies for web sites. Includes recommendations for buying and building vocabularies, and applying them to navigation and searching. Special search features include "synonym rings", while a hierarchical arrangement is often known as a taxonomy. Provides additional tips and suggestions based on extensive experience.

Classification Society of North America

International Federation of Classification Societies

Bibliography on Automatic Text Categorization
Fabrizio Sebastiani, a research scientist at Consiglio Nazionale delle Ricerche (Italian National Research Council) provides a listing of scientific research articles on automatic classification and categorization.

Classification, Indexing, Metadata and Thesauri - link page at UMass Amherst

Taxonomy, Classification and Metadata Resources - link page at Scottish electronic Staff Development Library

See also: Faceted Metadata and Search


Articles and Overviews

  • Taxonomies for Practical Information Management NIE Enterprise Search Newsletter, April 25, 2003 by John Lehman
    A clear and concise description of the kinds of categories used in business (such as Industry Segment, Technologies, Geography) and a helpful checklist of the useful elements of a taxonomy. By a leader in the field of classification.
  • Ten taxonomy myths Montague Institute, November 27, 2002
    Taxonomy experts discuss the many kinds of taxonomies, separating content and user-oriented taxonomies and the process of creating taxonomies. Very useful advice based on long experience.

  • Unstructured Data Management: the elephant in the corner (guest or customer access required) the451 Report, November 2002 by Nick Patience and Rachel Chalmers
    This report describes the huge amounts of unstructured data in enterprise computer networks, wasted time re-creating this information, and the lack of tools equivalent to data mining, business intelligence and OLAP. Identifies four application sectors: content, document and knowledge management; search and retrieval; categorization, taxonomy and data visualization; XML databases. The analysts evaluate these sectors from a business perspective, defining strengths, trends, pricing and leading vendors. Point out that the Web expanded the size of the search market but did not sustain it, while the categorization market is volatile, with many small and recently-acquired companies. Analysts believe that the leading relational database vendors (IBM, Oracle and Microsoft) may be able to lead the unstructured data market as well. Describes Verity K2 as an integrated search and taxonomy system, preferred over Autonomy, which is a "Rolls Royce" company in knowledge management and collaboration, Ultraseek (Inktomi Enterprise Search), a lower-value search engine even with the Quiver taxonomy engine. In Categorization and Visualization, they feel that InXight is the best among a field that includes Antarctica, Applied Semantics, ClearForest, IBM Lotus Discovery, Mohomine, Entrieva (Semio), Stratify and The Brain.

  • Standards Target Categorization eWeek, July 15, 2002, by Jim Rapoza
    Describes the value of standards for categorization, and the value of RDF and other standards which describe the most important terms and categories for a document. While not yet widely adopted or supported, this will reduce the demands on content categorization engines.

  • Three Paths to Sorting Content eWeek, July 15, 2002 by Jim Rapoza
    Describes testing three categorization engines and compares approaches: Applied Semantics' Auto-Categorizer 1.1, Interwoven's MetaTagger 3.0 and Thunderstone's Texis Categorizer 4.1. Used three sets of content: science educational materials, government health documents, and analyst reports, in a variety of file formats. Worked through process step-by-step with vendors. Finds that they address different aspects, require technical integration in different ways. Recommends adding some category suggestions during authoring, adjusting standard taxonomies to individual company usage.

  • The difficulty of categorization Knowledge Management Connection, spring 2002
    Proposes several ways to make classification easier in organizations, including focusing on information seekers, leveraging existing taxonomies, using the right people and tools, and avoiding up-front universal structures.

  • Taxonomy & Content Classification: Delphi Group Report (guest or customer access required) Delphi, April 11, 2002
    Defines the increasing need for tools to organize information and avoid overload, especially with ambiguous words. Covers the evolution of technologies from simple to more complex search engines with extreme recall, metadata search, link ranking, and taxonomy-creation software for hierarchical structures. Describes how people find information, both for known item searches and for discovery about a topic, where arrangements of subject categories trigger associations and relationships. Interactive and iterative browsing within categories helps locate dynamic data which otherwise might be hard to find. Searching within is more focuses and avoids false matches. Taxonomies can integrate with internal applications such as proposal-generation, CRM and data mining. Provides a useful checklist for analyzing taxonomy tools. Survey results of 450 executives, IT or managers at large enterprise organizations with at least 50,000 documents shows: most spend more than 2 hours per day searching for information, 73% say that finding information is difficult, the main impediments were "bad tools" (28%) and "data changes" (35%). Describes the principals of automatic classification and taxonomy-building, with details on eleven products. NOTE: vendors paid Delphi to be included in this report.

  • Tools of the Trade eWeek April 1, 2002 by Lisa Vaas
    Describes the value of good taxonomies in the context of a portal. Describes the situation at the U.S. Geological Survey's Center for Biological Information, where, despite a biological taxonomy and expert staff, the project is stymied by internal issues such as security. The cost of automatic portal taxonomy building is a big block to the law firm of Hunton & Williams, as is the difficulty of defining content organization.

  • Taxonomies are what? Free Pint, October 4, 2001 (issue 97) by Liz Edols
    Some people now say that 'taxonomies are chic', this article provides excellent coverage of varying definitions and articles about taxonomies, examples and relationships to classification and indexing. Also lists taxonomy software.

  • Taxonomy: Creating Knowledge from Chaos IT-Director.com; April 18, 2001 by "the Snark"
    Defines taxonomy as "the creation of structure (arrangement) and labels (name) to aid location of relevant information," which can be expressed in hierarchical categories. Within the context of web searching, sees XML as an enabling technology to express taxonomies. An example is Microsoft, which created a complete taxonomy of all information (manually) and reported a 62% reduction in the number of clicks required to find information and an 11% increase in the task success rate.

  • Managing Taxonomies Strategically Montague Institute Review, March 2001
    Summary outline of a longer article defining taxonomies and describing how they work.

  • Little Blue Folders Argus Center for Information Architecture; July 10 2001 by Peter Morville
    Discussion of the hyperbole and reality of automated categorization, with a strong recommendation for combining software-generated classification and human editorial judgment.

  • Online Taxonomies: The Next Shift for Information Science Content Wire; March 2001 by Paola Di Maio
    Meditation on the nature of online taxonomies, which should be dynamic, current, "referenceable" and scalable.

  • Word Wranglers Intelligent KM January 1, 2001 by Katherine C. Adams
    Survey of automatic classification software, as a support function for Enterprise Knowledge Management. Background includes the value of classification within enterprise, how taxonomies facilitate browsing, topical relevance, how classification technology works, and so on. Divides approaches up into rules-based, example-based and statistical clustering. Predicts that "cyborg" cataloging will combine automated tools with human judgment; taking advantage of XML metadata, and new technology such as Support Vector Machines (SVM). Table of features covers Autonomy, Cartia Themescape, Inxight, Metacode [bought by Interwoven], Mohomine, Semio and Verity.

  • Automatic Categorization of Magazine Articles Proceedings of Informatiewetenschap 1999; 12 November 1999 by Marie-Francine Moens and Jos Dumortier.
    Academic report on results of tests with automated classification. Compares their c2 system with Bayesian and Roccio algorithms, not surprisingly, theirs comes out better. Most interesting is the evaluation of how well the categorization performed compared to human classifiers. Examples, in Dutch, are included in the appendix.

  • Practical Taxonomies: Hard-won wisdom for creating a workable knowledge classification system (from Knowledge Management: Built to Order) January 1999 by Sarah L. Roberts-Witt
    Describes the challenges of creating a knowledge-classification system and the results of a survey of enterprise administrators who have done it. Includes quotes from consulting firms, government agencies and accounting firms. Recommends "broad and flat" organization as more straightforward, avoiding a search for perfection and allowing for evolution. Suggests a balance between human and automatic categorization.

  • Complexity In Indexing Systems -- Abandonment And Failure: Implications For Organizing The Internet ASIS Conference, October 1996 by Bella Haas Weinberg
    Describes the history of classification schemes from Dewey and UDC through Ranganathan and CIFT and the changes in the US Library of Congress Subject Headings approaches. Complex and individualized indexing schemes tend to move towards the "synthetic", providing standard subdivisions to be applied as appropriate, because of cost, training requirements, human inconsistency and rigidity due to theoretical flaws. Describes problems with applying traditional approaches to the vast and varying content of the Internet, and recommends applying back-of-the-book index approaches to the Web.

Classification Tools

Now listed on the Classification Tools page.
searchtools.com

SearchTools Listing: Classification and Taxonomy Tools

Tools for Taxonomies, Browsable Directories,
and Classifying Documents into Categories


</center>

For Definitions, Articles and Resources, see the Taxonomies and Classifiers page which discusses the entire concept of automated classification, categories, taxonomies, clustering, hierarchies, and browsable listings. See also the Visualization Tools report.

Classification Tools & Services

Applied Semantics Auto-Categorizer
Combines automated categorization with editorial tools for human judgment in building taxonomies, with no training documents or rules required. Categorization works in real time. Processing based on a linguistically created ontology combining millions of words, meanings and conceptual relationships. Automatically categorizes given content and then allows administrators to create unique taxonomies with a Windows client console, mapping categories and subcategories to the main ontology. Includes a test tool to view content in ontology. Works across many languages, scales quickly, returns results immediately, integrates using XML, APIs for C, Java, Perl and Visual Basic.
Report: Unstructured Data Management: the elephant in the corner (guest or customer access required) the451 Report, November 2002 by Nick Patience and Rachel Chalmers
Discusses the business aspects of the company, pricing, markets (domain names, paid placement for search terms, and publishing), competition and the technology. Finds the company profitable, successful in its niches, but not clear about scaling.
Report: Searching for Value in Search Technology (subscribers only) Gilbane Report Vol 10, Num 7, September 2002 by Sebastian Holt
Highlights solution providers, praises Applied Semantics for starting with their proprietary ontology and applying technology to solve real problems in the target markets.
Article Three Paths to Sorting Content: AutoCategorizer 1.1: eWeek, July 15, 2002 by Jim Rapoza
Description of the approach to categorization, praises ease of control, support for publishing industry. Approx. $150,000.
Article: Finding the right piece of information, Structure and meaning improve search results KMworld Magazine, September 2001 by Judith Lamont
Describes several intranet and commerce search solutions which attempt to use structure and language to improve search results. Certified General Accountants of Ontario, a professional association for accountants, used Hummingbird's KnowledgeServer to find content, reducing call center inquiries. Coldwater Creek used EasyAsk for its online store, to improve responsiveness to customers. Applied Semantics categorizer and summarizer can augment search engine results.
 
AskJeeves JeevesOne (link to SearchTools Listing Report)
Categories created to answer questions rather than provide a simple listing of page matches.
Examples: AskJeeves, Dell Online Support: Ask Dudley

Autonomy Categorizer (see also SearchTools Listing for Autonomy Search)
Creates concept maps and topic clusters without human intervention by using Bayesian probability and pattern-recognition. Users have complained that the system can be slow and the automation difficult to understand.  
Report: Taxonomy & Content Classification: Delphi Group Report (guest or customer access required) Delphi, April 11, 2002
Vendor-submitted information about the automatic cross-lingual clustering and classification, tagging with metadata, generating hierarchical trees..
 
Bravo engine from Global Wisdom
Gathers feedback from end-users to update taxonomies and create topic maps. 
 
BrightPlanet Deep Web Directory
An automatic portal indexer and classifier service, places high priority documents into portal directory structure. Also performs metasearch on local and external sources such as webwide search engines and web searchable databases.
 
Carrot2
Carrot2 is a research framework for experimenting with automated querying of various data sources (such as search engines), processing search results and their visualization. Unfortunately, the system does not work with Mac OS X browsers. Thanks to Gary Price of Resource Shelf for the link.
 
Convera RetrievalWare (link to SearchTools Listing Report)
Categorization using semantic models, handles multimedia content.
Article: Taxonomy & Content Classification: Delphi Group Report (guest or customer access required) Delphi, April 11, 2002
Vendor-submitted information about categorization using synchronized taxonomies, semantic networks in many languages.
 
DarWin Set (link to SearchTools Listing Report)
Dynamic Categorization with categories created on the fly. There are no predefined or established categories.

EasyAsk (link to SearchTools Listing Report)
Works with database structures to generate categories, and puts items from multiple search engines into topical categories.

80-20 Discovery (link to SearchTools Listing Report)
Uses neural net algorithms to create categories on the fly.
 
Entrieva (formerly Semio)
Data mining software uses linguistic analysis and rules to extract concepts from textual information and displays the concepts and relationships in a 3d map. Creates taxonomies based on fitting content into existing categories in the fields of defense, drugs, health care and technology.
Report: Unstructured Data Management: the elephant in the corner (guest or customer access required) the451 Report, November 2002 by Nick Patience and Rachel Chalmers
Discusses the business aspects of the company, pricing, markets, competition and the technology. The recent acquisition is a cause for concern, but also an opportunity to create more standard APIs and support Web Services.
Report: Taxonomy & Content Classification: Delphi Group Report (guest or customer access required) Delphi, April 11, 2002
Vendor-submitted information covers linguistic analysis, concept extraction and statistical clustering techniques. Works with lexicons and existing taxonomies. Comes with vertical market taxonomies and thesauri.
Article: Semio Brings Concepts To Web Search System WebWEEK, 1997/02/03 by Jeremy Carl
Describes Semio's attempt to improve text searching by viewing conceptual relationships in a 3-D model.

GammaSite
Uses Machine Learning with a small training set to create taxonomies and add documents to categories. Allows human oversight of structure and easy maintenance.
Article: Get ready for the digital librarian Jerusalem Post; July 1, 2001 by Gwen Ackerman
Describes features of the software, quotes the president on the advantages of machine learning. Mentions a successful installation at the UK Daily Telegraph and winning a test run by the Encyclopedia Britannica.
Article: Content Taxonomy Talk Content Wire; September 6, 2001
Interview with the company officers describes the value of categorization specialization, winning the Britannica test, minimum manual labor for customers and taxonomy flexibility
   
GuideBeam
Reformulates queries and post-processes search results to cluster them by category. Example uses public search engines, but could easily be applied to site and Intranet search. 
 
H5 Technologies Content Categorization
Automatic organization based on "aboutness" using a proprietary algorithm that does not rely on linguistic analysis.
 
Hummingbird Knowledge Server (link to SearchTools Listing Report)
Automated clustering, categorization and visualization tools, along with a search engine, in a knowledge management suite.
Article: Hummingbird Fulcrum KnowledgeServer 3.5 CRN ChannelWEB Test Center April 6, 2001
Summary of test results indicates that automatic clustering "is not an exact science" and requires manual processing. Also describes features of crawling, hierarchical display, distributed searching, custom weighting and search query options.

HyperSeek (InteractiveWeb)
Database-oriented link catalog application includes customizable HTML for categories, control of search results listings, admin tools.
Examples: Custom version at SearchKing

IBM Intelligent Miner for Text (link to SearchTools Listing Report)
Includes linguistic analysis, vector-based automatic clustering and/or classifying documents into predetermined categories.
 
IBM Lotus Discovery Server
Automated tools include statistical analysis, evaluation of relationships among people and documents known as social networks, and clustering of content to create a taxonomy. Can access Notes databases, Domino.doc management system, web sites and intranets, Microsoft Exchange, etc. Creates a Knowledge Map to display categories and hierarchy for browsing, also integrates with Lotus search engines. Allows for human editorial oversight.
Article: Discovery Server Helps Take the Taxing Out of Taxonomies ePro magazine, November 13, 2002 by Jim O'Donnell
Description of Lotus Discovery Server which makes a taxonomy for navigating portals or intranets. Analyst Andrew Warzecha of Meta Group points out that taxonomies are just difficult to create, and that IT and corporate librarians must work together to do a good job. However, companies are looking for information systems and there is increasing awareness of the value of taxonomies.
Report: Unstructured Data Management: the elephant in the corner (guest or customer access required) the451 Report, November 2002 by Nick Patience and Rachel Chalmers
Discusses the business aspects of the product, sales channels, markets (mainly manufacturing and financial services), competition and the technology. Integrated tightly with Notes and Exchange, does not have any pre-built taxonomies, has a low profile.
 
IBM Text Analyzer Business Component
Performs high-volume text document categorization quickly, based on training, rules and natural-languge parsing.
 
ic-classify
Linguistic analysis for sorting documents into categories based on content similarity to training set. Uses natural-language processing, lexical knowledge and semantic categories to create information hierarchies and taxonomies.
 
Inxight Categorizer (link to SearchTools Listing Report)
Using linguistic and statistical technology from parent company Xerox PARC, Inxight's categorizer automatically classifies content and organizes by subject. Identifies entities such as people, places, companies and products. Can integrate with personalization tools to build individual categories. Can display results in a Star Tree visualization or more traditional text list. Taxonomy manager application provides interactive control for editors, training sets and specific rules also apply.
Report: Unstructured Data Management: the elephant in the corner (guest or customer access required) the451 Report, November 2002 by Nick Patience and Rachel Chalmers
Discusses the business aspects of the product, sales channels, markets (mainly enterprise, publishers, government, OEMs), competition and the technology. Considers this to be the best of the categorization and visualization tools.
 
Klarity
Analyzes large collections of documents and generates metadata conceptual terms based on seed documents about a topic.

LexiQuest Mine (see also LexiQuest SearchTools Report)
Uses linguistic analysis tools to categorize and classify unstructured data.
 
LexTek Profiler Engine and RouteX (see also LexTek Onix SearchTools Report)
Toolkits for automatic classification of document stores and routing of incoming documents, newsfeeds, email. C++ on many platforms.

Links (Gossamer Threads, Inc.)
A set of Perl scripts with an excellent browser administration interface with automated submissions and approvals. Free for personal use.
 
MAI (Machine-Aided Indexer)
Assists catalogers and classification experts by extracting concepts from documents and suggesting appropriate index terms.
 
MetaTagger for Interwoven
Generates a taxonomy or uses an existing one, categorizes documents into one or more taxonomies, extracts summaries, keywords and custom data such as dates and currency. Integrates with multimedia file format. Provides an interactive interface for editorial control.
Article Three Paths to Sorting Content: MetaTagger 3.0: eWeek, July 15, 2002 by Jim Rapoza
Detailed description of the approach to categorization, including advantages and disadvantages of tight integration with TeamSite. Tests show effective categorization during publishing workflow, easy corrections, generation of categories on the fly. Cost is $85,000 to $110,000 per server in addition to TeamSite.
 
Mohomine MohoClassifier
Uses pattern recognition and statistical algorithms as support vector machines and feature selection to distinguish among categories, APIs for integration with other software, and scales to millions of documents. Claims to need very small example sets
Report: Unstructured Data Management: the elephant in the corner (guest or customer access required) the451 Report, November 2002 by Nick Patience and Rachel Chalmers
Discusses the business aspects of the product, sales channels, pricing, markets (mainly resume processing for HR and various defense uses), competition and the technology. Foresees an acquisition of this company in the future.
Article: Taxonomy & Content Classification: Delphi Group Report (guest or customer access required) Delphi, April 11, 2002
Vendor-submitted information describes neural network approach based on pattern recognition and machine learning. Populates customer-defined taxonomies, small example sets, multiple categories per document.
Article: Knowledge Management: Mohomine FinancialWeb; January 25, 2001 by Kristen Rosen.
Interview with the CEO, Neil Centuria, about the source and background of the company: it was started as a search engine and classification tool to support the Source Bank code snippet archive site and extended to other structured data. It has special features to support multiple terms for the same meaning (such as B and BK for black), populate vertical portals with directed crawling and fast updating. Claims the classification software can eliminate the editors needed to create a directory.
Article: Create User Loyalty by Improving Search Capability ClickZ Today October 18, 2000 by Paul Bruemmer
Inspired by the Forrester search report, this article includes marketing information from several search engine developers, including AltaVista, Mohomine, and Twirlix.
Article: In Search Of... If you want good search on your site, commit to doing it right unless you want to alienate visitors Industry Standards, October 9, 2000 by G. Patrick Pawling
Describes problems with site search, such as searches which expand names too far (Seger to Segarra for example) and summarizes the Forrester Report on Search. Reports that one company put "buy" buttons on search results pages and found 30% of its orders came from there. Mentions Mercado options to adjust search results for e-commerce purposes. Describes Mohomine automated summarization and categorization tools. Quotes Jupiter Communications analyst Lydia Loizides as estimating the cost at between $50,000 and $2 million. Describes alternate approaches, such as a conversational or interview-driven search, and choosing an area, such as multimedia, to reduce the number of inappropriate matches.
 
MondoSearch (link to SearchTools Listing Report)
Creates classes based on the server file path, which can be adjusted and renamed by administrators.
 
Muscat Structure (link to SearchTools Listing Report)
Automatic realtime categorisation using a rulebuilder tool to specify documents which do or do not fall into any specific category in a taxonomy. Rules can be built automatically, manually or both, can be general or specific. See also Muscat Discovery search engine page.
 
Netscape Compass Server now part of iPlanet Portal (link to SearchTools Listing Report)
Automatically creates a customized category tree to show the hierarchical organization of the data. Users have complained that the categorization can be erratic.
  
Endymion OpenBridge (Formerly ZNOW, see also SearchTools Listing for OpenBridge)
Automated classification puts pages into topics based on common words among a set of search results. Hierarchical clustering lets users chose more and more specific topics. Linguistic problems sometimes appear, such as using the term "booking" instead of "books".
 
PortalAuthor
Java-based organizational tool for Intranets and corporate portals.
 
 
 
Readware
Sophisticated classification system with a ConceptBase filled with fundamental structures of knowledge, then correlated to queries and documents.
 
Recommind Categorizer
Using an existing classification or taxonomy, this software can automatically classify documents based on semantics and probabilistic analysis.
Customer response to MindServer Recommind Press Release, June 25 2002
Quotes the head of IT Development at ZDF, Europe's largest television station saying that MindServer expands capacity, improves quality and accessibility. Other materials quote a program manager at the Department of Energyís Office of Science and Technical Investigation (OSTI), saying that the categorization tool "significantly outperformed our human experts in terms of accuracy and consistency."

Roads Project Software
Free academic classification system which also connects distributed cooperative databases.
Example: Biz/Ed - Business Education online in the UK.
 
Sageware
Allows catalogers to set up sophisticated topic definitions, inserts documents matching the rules automatically.
Article: Sageware: Creating the Categories for Information Retrieval Patricia Seybold Group Snapshot, December 1997 by Geoffrey Bock.
 
Saqqara ContentWorks
Includes automatic classification and a taxonomy manager, designed mainly for b2b databases of products.
Article: Product Content Management From E-Catalogs Supplier Content Wire, December 10 2001
 
Stratify (formerly PurpleYogi)
Unstructured data management system includes a Classification server. Can crawl or integrate with a robot crawler from a search engine. Taxonomies can be imported, created from the top down, clustered from the bottom up or interactively designed. Uses rules, statistics and pattern matching to classify documents into taxonomy categories, subject to editorial control. Can categorize search engine results. Had partnered with Inktomi before that company acquired Quiver.
 
Report: Unstructured Data Management: the elephant in the corner (guest or customer access required) the451 Report, November 2002 by Nick Patience and Rachel Chalmers
Discusses the business aspects of the product, sales channels, pricing, markets (mainly manufacturing, government, financial services, etc.), competition and the technology.
 
Super Site Server (link to SearchTools Listing Report)
Search engine and classification tool combined
Example: Thunderseek.com - a selective portal for the 'best of the best'
Example: E Fetch - animal-oriented directory and portal
 
Thinkmap
A tool for displaying complex information using an animated multidimensional display designed for user interaction using Java.
Example: Plumb Design Visual Thesaurus

Thunderstone Automated Categorization Engine (see also Thunderstone Webinator search engine)
This application works with the Webinator indexer to add pages to categories automatically. Based on training sets for each high-level category, with additional adjustment options available. Uses the Texis SQL database and Vortex scripting language for configuration, browser admin for maintenance. Runs on Unix, Linux and Windows.
Article Three Paths to Sorting Content: Texis Categorizer 4.1: eWeek, July 15, 2002 by Jim Rapoza
Describes product as providing good categorization, easy to install for many Web developers, flexible, interoperable, and inexpensive ($10,000 for the Texis engine and $10,000 for Categorizer).
Example: Thunderstone Web Site Catalog
 
 
TopicalNet
Provides automated systems to classify pages with a customer taxonomy. Uses the web to fit 40 million pages into 70,000 subtopics within 15,000 categories. Runs on Linux or Windows 2000, can classify 1.5 million pages per day from text, HTML or XML sources, conversions available for MS Office, PDF and other formats.
Press Release: Inmagic Gatherer and Classifier Released June 10, 2002
InMagic announces a partnership with TopicalNet to integrate classification with the content management and search engine.
Article: Taxonomy & Content Classification: Delphi Group Report (guest or customer access required) Delphi, April 11, 2002
Vendor-submitted information describes the value of the pre-built taxonomy and integration with existing taxonomies. Uses both semantic and syntactic knowledge.

Verity Knowledge Organizer
Creates knowledge trees and category hierarchies using rules.
Article: Taxonomy & Content Classification: Delphi Group Report (guest or customer access required) Delphi, April 11, 2002
Vendor-submitted information creates new taxonomies or works with existing systems, can create an automatic or hybrid model, populate it with documents.
Article: Verity add-on makes portals easier to build, navigate Infoworld, March 22 1999 by Emily Fitzloff
Product announcement for Verity Knowledge Organizer, which lets catalogers classify data and add pages to categories automatically. 

Verity Ultraseek Advanced Classifier (formerly Quiver)
Provides a taxonomy management system, designed to integrate automatic classification with editorial control and workflow
Article: Taxonomy & Content Classification: Delphi Group Report (guest or customer access required) Delphi, April 11, 2002
Vendor-submitted information describes hybrid automatic and human classification, based on the Naïve Basin algorithms, machine cleaning and tokenization.
 
Verity Ultraseek CCE (Content Classification Engine) (see also Ultraseek Search Report)
Extends the Ultraseek (previously Inktomi Search Software) engine by allowing administrators to import a site map and specify toplevel and subcategories, then create rules using wildcards and regular expression searching to match pages to those categories.
Example: Northstar - State of Minnesota Government Information
 
Vivisimo
Post-processing clustering organizes search results from another engine into folders by topic. This is very dependent on the quality of search results, and can miss topics entirely or group them into generic categories such as "products".
Article: Vivisimo Meta Search Engine FreePint, September 29, 2000 by Simon Collery
Web search expert evaluates the clustering results, give qualified approval to the process.

WordMap
Enterprise taxonomy management program, available as software or remote service (ASP). Interactive tool presents the user with a list of all possible meanings, then reformulates the query to improve both recall and precision. Options allow the user to send queries to multiple search engines. Company also develops subject taxonomies and offers consulting.
Example: WordMap web search
This demo shows how the query disambiguation interface works. Try "bank" or "lotus".
Article: Wordmap Launches New Taxonomy-Building Service Information Today, January 2002
Describes taxonomy services, including multilingual versions, links among subject areas and metasearching.
Announcement: Wordmap-Software used by DaimlerChrysler for enterprise taxonomy February 6, 2002
Describes the plans for using Wordmap within the automobile company's content architecture. 
 
XFML - eXchangable Faceted Metadata Language
An XML format for interchange of faceted metadata, mainly within hierarchical taxonomies. It allows people to tag a set of topics and associated URLs so that other applications supporting this format can recognize the relationships.

See also:

CGI Resources Site - Link Indexing Scripts

Search Tools:

searchtools.com

Guide to Search Tools: Why Searches Fail

Why Searches Fail

</center>

 

No matter how carefully you design a search engine, there will always be some searches that fail. Our search log analysis shows that it's often a simple mistake or misunderstanding, so there's no one to blame. Some search engines are better than others at finding useful matches and providing helpful pages when nothing can be found.

Top Five Reasons Why Searches Find No Matches

1. Empty searches

Amazing numbers of people just click the search button or press Return when the cursor is in the search field. So the search engine gets an empty query, which is usually treated as a search failure.

What to do: either chose a search engine which brings up a helpful search page when the search is empty, or make sure that you have wording on your no-matches page to explain what is going on.

2. Wrong Scope (trying to search the whole Web)

Whether it's a site, Intranet or small portal, your search engine covers only the topics on your site, not the entire Web. For example, searching for shot put on a site about medieval art is just as useless as searching for marginalia on a sports portal! Despite this, people often see the search field or button, and think they're getting a webwide search engine. If you look in your search logs, you'll see the wide variety of searches which clearly are not directed correctly.

What to do: Several search engines provide versions with options to search the entire web, as well as the local site or portal. In any case, the no-matches page should explain the scope of the site and explain what materials the search index covers.

3. Vocabulary Mismatch

Searches containing terms which are too specific, too general or just not used. For example, on a medical site, someone searching for doctor may need to look for physician; a horse site may talk about Paso Finos instead of walking horses; and a beginning web design portal may not include any sites that get into details on onblur tags, though they may have some links to JavaScript pages.

What to do

Some search engines perform linguistic stemming -- they search for several forms of a word instead of just the exact match. For example, a search for run might also find running, runs, ran, or runner; a search for goose might also look for geese. This makes it more likely that a search will find a match, but must be implemented carefully so it doesn't display inappropriate matches at the top of the results list. Note that multilingual sites must be very careful in implementing the correct stemming algorithm for each page's languages, or they may return some very odd results.

Adding metadata to pages can help match search words, especially for broader and narrower terms. If you have pages describing DSL and cable modems but never use the word broadband, adding that term to the META Keywords tag content will allow the search engine to find those pages

Search engines may allow search administrators to set up synonym lists or thesari to provide appropriate alternate terms. For direct synonyms, such as urticaria as the technical term for hives, it's appropriate to simply add that to the query. For less obvious equivalents, such as red for crimson, and for broader and narrower terms, the search engine should display them and allow the searcher to click on them rather than typing them in.

4. Spelling Mistakes

People make mistakes in spelling all the time. For simple typos, log file analysis shows that searchers will re-enter the word correctly. However, they often are unable to remember unfamiliar terms such as diseases, product codes, and names in general.

What to do

  • Synonym lists (described above) can help with common misspellings of important words on a site -- the engine can simply translate the bad spelling to the correct version and continue.

  • A spellchecker can provide a list of correctly-spelled words, allowing searchers to switch to the right spelling.

Complex Solutions

Information Retrieval theorists have come up with some clever solutions to spelling errors, but they can be frustrating to users unless presented properly. Be careful when implementing a search with fuzzy matching: use usability testing and search logs to track whether the results are substantially better than exact matching.

  • Fuzzy matching techniques try to reduce words to their core and then match all forms of the word. For example, searching for serach would properly locate search, but a search for locks might find looks, whether or not that was wanted.

  • Phonetic, sound-alike or "Soundex" matching uses linguistics to search for words which may sound similar. This is particularly useful for names, so a search for licos will find Lycos, but also brings up odd results: a search for fuzzie may match fees or face.

5. Query Requirements Not Met

If the search engine automatically searches for all terms or a phrase, or the query includes operators such as + or NOT, there may be no pages which fulfill all the requirements. Examples include brown bear(both terms), "Olympic gold medal" (phrase), claymation +British (required term), MP3 NOT Napster (excluded terms).

This can also happen when searching in a particular section or zone of a site for words which are used in different sections of the site. Yahoo and CNet are good examples where people can limit searches to a subsection, but searching for roses in the tropical fish area will never find a match.

What to do: Search engines which show the number of pages matched for each term are particularly helpful in this case. The no-matches page can clarify which of the terms caused the problem and provide advice on how to enlarge the search.

Other Causes of Search Failure

  • Problems with Query Syntax: some search engines are very picky about how the query must be entered. If a searcher puts a space after an operator such as +, or uses NOT instead of AND NOT, the search may fail.
    • The search engine should provide helpful error messages and instructions in this case, or better yet, be more accepting in its query parsing. For example, if a close parenthesis ")" is missing, the engine can simply add it at the end of the query, with an explanatory message in the results page.
  • Capitalization and Extended Characters: some search engines require exact matches for capital letters and diacritical characters (such as ü, ß, ñ). In that case, searches for pokeman will never find pages with the word Pokéman.
    • If your search engine is strict about these elements, make sure that explain the problem on the no-matches page.

  • Stopwords: to save index space and time, some search engine just don't include common words in their index. This can range from prepositions and conjunctions (such as a, an, the, with, from) to words which are extremely common within the index (baseball and TV on a sports site, for example). Unfortunately, people often search for these terms (think of As You Like It) and are confused when they can't find the pages which contain them.
    • If your search engine must use stopwords, make sure the no-matches page explains what they are and how to search around them.

  • Short Words: some search engines have a lower-size limit for indexing words, so they don't have to store thousands of entries for the word I and other short words. However, if those are required parts of a search, omitting them from the index means that the search nay not find any matches (to be or not to be, for example).
    • Whenever possible, index everything. If your search engine will not index short words, make sure the no-matches page explains how to search around them.

  • Numbers: some search engines don't index numbers, either because they are short or just because they are not words. But people search on them all the time!
    • If your search engine will not index them, make sure the no-matches page explains that numbers can't be searched.

Avoiding Future Failure

Be sure to track your search logs so you can see why and how your customers are having problems with search. Watch those searches that find 0 matches like a hawk, and do your best to add new synonyms, terms and information that addresses these questions.


searchtools.com

SearchTools Analysis: Recommendations for Search Results

Recommending Pages for Special Searches

Search engines use statistical and lexical analysis to match query terms to indexed text, but often, human judgment is more effective. Many search queries are for names, items or even ID numbers. For the most frequent questions on a site, it makes sense to manually identify the best page or pages, and integrate these into the search results. You may want to do this for the top 25 or 100 questions, depending on the traffic on the site. Instead just of relying on the search engine's ability to match terms, you can add human understanding to make the results more useful for the common cases, making best use of both approaches.

Dell does a pretty good job of this: they make sure that a bunch of appropriate introductory pages show up when users type the query "linux" on their site. As shown the example below, if they had not done that, the search engine would have shown them only the Large Business Linux pages first, and people in other situations would have thought that there was no Linux solution for them.

Another way to approach this is to consider the most common searches on a site as candidates for a Knowledge Base, and the manual links in the search results as pointers into that database or FAQ.

The 2001 Forrester Search Report quotes a telecom company site administrator as saying "About half of our visitors come to the site looking for a specific product." When a site visitor searches by a code, such as an error message or product ID, the search engine should recognize the pattern and make sure that the appropriate types of pages, such as product specs, troubleshooting or FAQs, come first in the search results.

For example, at the Sharp USA site, searching for a laser printer using Atomz search and promote tools brings up a special offer:

Articles on This Topic

  • Intranet Usability: The Trillion-Dollar Question Alertbox at Useit.com, November 11 2002 by Jakob Nielsen
    Results of an international usability study by the NNGroup on intranets finds that many are wasting employee time by failing to provide usable intranets. They found that "search usability accounted for an estimated 43% of the difference in employee productivity between intranets with high and low usability." They recommend that intranets make sure that the main search engine indexes all pages, shows results in relevance order with manual recommendations at the top, encourage useful page titles and descriptions. The NNGroup finds that tasks requiring 27 hours annually on a usable intranet could take as long as 196 hours on a less-usable one.
  • Beyond the spider : the accidental thesaurus. Searcher Magazine, October, 2002, by Rich Wiggins new
    An excellent article providing examples of manual recommendations at the AT&T web site and AOL's Keywords. A case study of creating the Keywords feature at Michigan State University describes the process of designing and maintaining a manual recommendations database. Also analyzes the distribution of queries in the search logs, which conform to the Laws of Pareto, Bradford and Zipf, showing that concentrating on a very few common cases will provide the best return for effort. A similar function, implemented at Bristol-Meyers Squibb, has had great success.
  • Why search is not a technology problem: Case Study, BBCi Search (follow link to 3.4 MB PowerPoint file) ASIST IA Summit February 2002 by Matt Jones
    Describes how the British Broadcasting Corporation added a search engine to their web site, offering access to both all the information in the BBC site and web-wide information. The group created some "use models" to show examples of information needs and find ways to add context to search results. They chose to create a taxonomy based on the most frequent searches, and test the process over the course of several months, starting with paper prototypes. They found significant user sensitivity to wording and layout, and to functions that break with their expectations. Showing manual recommendations was successful once they integrated with the other results, and people loved the search zones tabs once they found them (after several visits). Recommends early testing and use models to learn how people really use search engines.

  • New HP Wetware Product leads to Smarter Search Louis Rosenfeld's Blog, September 13, 2001
    Describes the recommendations links at the Hewlett Packard site for such common topics as handhelds, and suggests that the hybrid of manual "best bets" for popular queries with standard search results is a hopeful trend.

Examples of Sites with Recommendations

Search Tools Supporting Manual Recommendations

  • AskJeeves - originally a question-answering customer service search engine
  • Atomz - recognizes special meta tags for top ranking, Atomz Promote provides powerful tools to control exactly what appears for specific search results.
  • Google Search Appliance - uses a list of KeyMatches for manual answers before normal results.
  • FusionBot - search service uses a list with query terms and results weights.
  • MondoSearch - for specified search terms, sends user directly to answer page.
  • PicoSearch - allows "promotion" and "demotion" of results based on patterns.
  • ResultPage - allows manual editorial control over search results.
  • SharePoint (Microsoft) - "Best Bets" listings appear before results.
  • Ultraseek - Quick Links listing for recommended answers before results, also adjustable weighting for search results by metadata and URL paths.
searchtools.com

SearchTools Report: Simple URLs for Search Engine Robots

Generating Simple URLs for Search Engines

</center>

Search engines generally use robot crawlers to locate searchable pages on web sites and intranets (robots are also called crawlers, spiders, gatherers or harvesters). These robots, which use the same requests and responses as web browsers, read pages and follow links on those pages to locate every page on the specified servers.

Dynamic URLs

Search engine robots follow standard links with slashes, but dynamic pages, generated from databases or content management systems, have dynamic URLs with question marks (?) and other command punctuation such as &, %, + and $.

Here's an example of a dynamic URL which queries a database with specific parameters to generate a page:

http://store.britannica.com/escalate/store/CategoryPage?
pls=britannica&bc=britannica&clist=03258f0014009f&cc
=eb_online&startNum=0&rangeNum=15

All those elements in the URL? They're parameters to the program that generates the page. You probably don't even need most of them.

Here is a theoretical version of that URL rewritten in a simple form:

http://store.britannica.com/Cat/B-online/03258f0014009f/s0-e15/

If you look at Amazon's URLs, you'll see they contain no indication to a robot that they're pointing into a database, but of course they are.

Problems with Dynamic URLs

Some public search engines and most site and intranet search engines will index URLs with dynamic URLS, but others will not. And because it's difficult to link to these pages, they will be penalized by engines which use link analysis to improve relevance ranking, such as Google's PageRank algorithm. In Summer 2003, Google had very few dynamic URL pages in the first 10 pages of results for test searches.

When pages are hidden behind a form, they are even less accessible to search spiders. For example, Internet Yellow Pages sites often require users to type a business name and city. For a search engine to index these pages requires special-case programming to fill in those fields. Most webwide search engines will simply skip the content. This is sometimes referred to as the "deep web" or the "invisible web" -- valuable content pages that are invisible to search engine robots, (internal and external), do not get indexed and therefore can't be found by potential customers and users.

Search engine robot writers are concerned about their robot programs getting lost on web sites with infinite listings. Search engine developers call these "spider traps" or "black holes" -- sites where each page has links to many more programmatically-generated pages, without any useful content on them. The classic example is a calendar that keeps going forward through the 21st Century, although it has no events set after this year. This can cause the search engine to waste time and effort, or even crash your server.

Readable URLs are good for more than being found by local and webwide search engine robots. Humans feel more comfortable with consistent and intuitive paths, recognizing the date or product name in the URL.

Finally, by abstracting the public version of the URL, it will not be dependent on the backend software. If your site changes from Perl to Java or from CFM to .Net, the URLs will not change, so all links to your pages will remain live.

Static Page Generation or URL Rewriting?

The simplest solution is to generate static pages from your dynamic data and store them in the file system, linking to them using simple URLs. Site visitors and robots can access these files easily. This also removes a load from your back end database, as it does not have to gather content every time someone wants to view a page. This process is particularly appropriate for web sites or sections with archival data, such as journal back issues, old press releases, information on obsolete products, and so on.

For rapidly-changing information, such as news, product pages with inventory, special offers, or web conferencing, you should set up automatic conversion system. Most servers have a filter that can translate incoming URLs with slashes to internal URLs with question marks -- this is called URL rewriting.

For either system, you must make sure that of the rewritten pages has at least one incoming link. Search engine robots will follow these links, and index your page.

If you have database access forms, dynamic menus, Java, JavaScript or Flash links, you should set up a system to generate listings with entries for everything in your database. This can be chronological, alphabetical, but product ID, or any other order that suits you. Search engine robots can only follow links they can find, so be sure to keep this listing up to date.

URL Rewriting and Relative Link Problems

Pages which include images and links to other pages should use relative links from the root rather than from the current page. This is because the browser sends a request for each image by putting together the host and domain name (www.example.com) with the relative link (images/design44.gif). This is based on the Unix file name conventions of current directory, child and parent (../) directories. Because URL rewriting mimics directory path structures, it confuses the browsers.

If your original file linked to the local images directory, then rewritten the link would break.

Original URL for this page

www.example.com/prod=?int+7=2&2&4

Rewritten URL looks like this

www.example.com/prod/int/7/224/

If the relative link to the dynamic location images directory would be:

dyn/images/design44.gif

in the rewritten path, this would be interpreted as

www.example.com/prod/int/7/224/dyn/images/design44.gif

instead of the correct path, which would be something like this:

www.example.com/dyn/images/design44.gif

Because the path to the images subdirectory is wrong, the image is lost.

Solution One: Use Absolute Links

To avoid confusion, use links that start at the host root, that means paths which all start at the main directory of your site. A slash at the start tells the browser that it should not try to find things in the local directory, but just put the host name and the path together. The disadvantage is that if you move or change the directory hierarchy, you'll need to change every link which includes that path.

Using the same example above, change the relative (local) link within the generated page from this:

images/design44.gif

to an absolute link that points at the correct directory:

/dyn/images/design44.gif

Similarly, a link to a page in a related directory (either static or rewritten)

perpetualmotion/moreinfo.html

Would require the entire path:

/prod/int/perpetualmotion/moreinfo.html

Note that if you change the directory name from "prod" to "products" or take the "int" directory out of the hierarchy, you'll have to change every one of those URLs.

Solution Two: Rewrite the Links Dynamically

Using a mechanism like URL rewriting, you can generate a path to the correct directory and program your server to create absolute links within the pages as it's generating them.

For example, if all the images are in the directory

www.example.com/dyn/images/

You could create a variable with the path "/dyn/images/" and the server would put that before all the relative urls to images.

Checking URLs on Your Site

    1. Perform a sanity check - make sure that your site does not generate infinite information. For example, if you have a calendar, see if it shows pages for years far in the future. If it does, set limits.
    2. Generate pages or Choose and implement a URL rewriting system - see below for articles and products.

    3. Check relative links for images and other pages.

    4. Create automatic link list pages so there are links to the pages generated dynamically.

    5. Test with your browser.

    6. Test with a robot You can use a linkchecker, local search crawler or site mapping tool to make sure these URLs work properly, don't have duplicates, and don't generate infinite loops.

Articles About URL Rewriting

  • Search Engines and Dynamic Pages (members only) SearchEngineWatch.com, updated April, 2003 by Danny Sullivan new
    Very clear description of the process of rewriting URLs for getting pages indexed by public search engines, and includes links to articles and rewrite tools.

  • Towards Next Generation URLs Port 80 Software Archive, March 27, 2003 by Thomas A. Powell & Joe Lima new
    Helpful explanation of how to address problems with URLs, from domain name spelling through generating static pages and rewriting query strings.

  • Making Dynamic and E-Commerce Sites Search Engine Friendly SearchDay (SearchEngineWatch), October 29, 2002 by Catherine Seda new
    A report from a panel at the Search Engine Strategies 2002 conference provides strong justification for simplifying URLs, and strategies to work around the problem when the dynamic URLs must remain.

  • Using ForceType For Nicer Page URLs DevArticles.com, June 5 2002 by Joe O'Donnell new
    Apache's ForceType directive doesn't require access to the main configuration file, rather it uses a local ".htaccess" file for rewriting the URLs. The article includes excellent examples for implementing this with PHP.

  • Making "clean" URLs with Apache and PHP evolt.org, March 29, 2002 by stef
    Gives some context for dynamic sites, describes a solution using both Apache ForceType and PHP. Comments offer some interesting thoughts about the value of stable URLs.

  • Search Engine Friendly URLs (Part II) evolt.org, November 5, 2001 by Bruce Heerssen
    Describes how to generate dynamic absolute path links using PHP.

  • How to Succeed with URLs A List Apart; October 21, 2001 by Till Quack
    Using the Apache .htaccess to direct requests to a PHP script which converts the URL to an array, ready to send the query to the database or other dynamic source. Includes helpful comments, checks for static pages, default index pages, skipping password-protected URLs, and handling non-matching requests. Also covers security protections, showing how to strip inappropriate hacking commands from URLs.

  • Search Engine Friendly URLs with PHP and Apache evolt.org, August 21 2001 by Garrett Coakley
    Very simple and easy to understand introduction.

  • Optimization for Dynamic Web Sites Spider Food site, August 13, 2001
    Overview, with examples, of the value of converting dynamic URLs to static ones.

  • Search Engine-Friendly URLs PromotionBase SitePoint, August 10, 2001 by Chris Beasley
    Three ways to convert dynamic URLs to simple URLs using PHP with Apache on Linux. These include using the $PATH_INFO variable, the .htaccess error handling, or Apache's .htaccess ForceType directive, using the path text to filter certain URLs to the PHP application handler.

  • Invite Search Engine Spiders Into Your Dynamic Web Site Web Developer's Journal; February 28, 2001 by Larisa Thomason
    Nice introduction to simple URL rewriting, useful warning about relative link problems.

  • Building Dynamic Pages With Search Engines in Mind PHPBuilder.com, June 2000 by Tim Perdue
    Details of a PHP setup which can scale up to 200,000 pages and 150,000 page views per day. Automatic generation of the page header also includes meta tags. In this example, the pages are arranged by country, state, city and topic, so the URLs generate those parameters for the database queries. Comments recommend sending a success HTTP status header before every page returned: Header("HTTP/1.1 200 OK"); running as an Apache modules vs. CGI on Windows, use of the ForceType directive, and Apache 2 compatibility.

  • URLs! URLs! URLs! A List Apart; June 30, 2000 by Bill Humphries
    Recommends creating a simple system for URLs and mapping them to the backend. Describes using Apache's mod_rewrite component, optionally using .htaccess.

URL Rewriting Tools

  • Apache
  • Perl
    • Using the environment variables Path_Info and Script_Name provides access to the dynamic URL including the "query string" (the part after the question mark). Converting the query information into a single code and adding it to the path creates a static URL.
  • PHP
    • see articles above, especially the one on Using ForceType, and PortalPageFilter below
  • Microsoft IIS and Active Server Pages (ASP)
    These filters recognize slash-delimited URLs and convert them to internal formats before the server gets them.
    • ASPSpiderBait - package converts the PATH_INFO part of an HTTP header request: the user doesn't see the punctuation, just placeholder letters. The filter replaces the placeholders with punctuation them before the server can see them. $100 per server.
    • ISAPI_Rewrite - IIS ISAPI filter written in C/C++, simple rule for dynamic conversion. Lite version is free, Full version, single server, $46, enterprise license, $418
    • IISRewrite - IIS filter, works much like mod_rewrite, examples include converting dynamic to simple URLs. $199.00 per server
    • PortalPageFilter - a high-priority C++ ISAPI filter for ASP
    • XQASP - high-performance NT IIS C++ or Unix Java filter for ASP. $99 to $2,100 depending on scope
  • ColdFusion (CFM)
    • May have an option to reconfigure the setup, replacing the ? with a /
    • Use the Fusebox framework, <cf_formurl2attributessearch> tag.
  • Lotus Domino Servers
  • WebSTAR, AppleShareIP, etc. (Macintosh OS 9 and OS X)
    • Welcome module by Andreas Pardeike (included in WebSTAR 4.5 and 5)
searchtools.com

Product Report: Verity K2 and other Search Products

This is an off-site copy of the corresponding Product report page on the SearchTools.com website, and it is designed to allow you to comment on the product and/or the reporting. For more information about the topic of search and tools visit SearchTools.com where you can browse many articles, in-depth analysis and overviews of external resources.

Verity K2 and other Search Products

Verity Products Information, see also SearchTools - Knowledge Management

Verity purchased the Inktomi (Ultraseek) search engine November 13, 2002.

Price: not provided, generally $100,000 and up
Platforms: Windows NT 4, 2000; Unix: Solaris, HP-UX, AIX, IRIX, DEC Unix, Linux

Major Features

  • Verity K2 Developer April 2002
    • OEM toolkit for search, content organization and social networks
  • Indexing crawler and file system recognizes several hundred file formats.
  • Powerful security interfaces, on both collection and document level.
  • Indexes content from many sources, notably Lotus Notes collections
  • Extensive language support: Arabic, Bulgarian, Chinese (traditional and simplified), Czech, Danish, Dutch, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish and more.
  • Queries using Boolean search with options for stemming, fuzzy search and concept extraction.
  • Integrates with Verity Classification for category browsing.
  • Provides WebTop portal interface.
  • Enables communities of users and expert location.
  • Can scale to multiple servers for size, responsiveness and fault tolerance.

Articles and Reviews

  • Command Line K2 using Windows Scripting NIE Enterprise Search, April 25, 2003, by Miles Kehoe new
  • Search On Information Week, January 20, 2003 by Tony Kontzer
    Discusses the need for useful search engines in corporate intranets. Describes experiences in Ford's Learning Network using Autonomy; Bank One and Kaiser Permanente using the Google Search Appliance for simplicity, low cost and speed; KMPG UK's use of Verity K2 for sophisticated taxonomy and social networking; Gateway's implementation of iPhrase for support technicians; and EDS using Recommind for role-related search results.

  • Verity: True Value EContent Magazine, July 2002 by Sandy Serva
    Quotes from company president describe success with the Siemens worldwide web search, Cisco CD-based documentation and Southwest Bell tariff data handling.

  • Adding Search To Enterprise Apps InformationWeek, May 6 2002 by Tony Kontzer
    Information on the new K2 Developer version, which includes personalization and categorization capabilities for integration with OEMs. Reports that FileNet plans to replace Hummingbird Fulcrum with Verity K2 in it's CMS, Bell Globemedia Interactive has replaced AltaVista with Verity for text search, supplementing Virage video search.

  • Verity Soars to new Heights IT-Director.com, October 1 2001 by "Caterpillar"
    Praises features of K2Enterprise, especially the "Social Networks" feature which is supposed to identify subject experts by their behavior and link them to internal information and other experts. Also mentions high-profile Verity installations. Reader comments are more skeptical.

  • Seeking far and wide for the right data InfoWorld, August 27 / September 3, 2001 by Cathleen Moore.
    Describes the value of search engines and categorization as essential elements of corporate portal infrastructures, to handle the "deluge" of information within enterprises. Quotes Aberdeen analyst Guy Creese who points out that without a good way to search, corporations would be "blowing their investment in the content". Covers recent announcements of search and categorization features by Autonomy, Verity, AltaVista, iPhrase, and Smartlogik (Muscat).

  • Search technology gains recognition InfoWorld, July 30, 2001 by Cathleen Moore
    Covers search and categorization technology for corporations from Verity and Smartlogik (Muscat). Includes analyst quotes about the importance of search in handling huge amounts of content. Describes Verity "Social Networks" using business rules to make taxonomies more useful. Smartlogik's Muscatdiscovery search engine stresses natural-language searching, Muscatstructure provides rules-based categorization, both can integrate into application servers using Java or Com+.

  • Portal links buyers, sellers eWeek, February 5, 2001 by Grant Du Bois
    Coverage of Verity K2 Catalog, e-commerce portal framework for categorization, display and personalization of product information. Can search by price or part number, suggest complementary products. Allows search admin to control the products for specified strings, based on their search log analysis.

  • Portals Tame Chaotic Content -- Packaged software helps Nasdaq parent rein in multiple systems InternetWeek July 10, 2000 by Ted Kemp
    Description of the National Association of Securities Dealers intranet site, which is using the Verity Portal One software to customize the portal, index, search and categorize files.

  • [Choosing a Search Engine for INRIA]: Rapport Final (in French) INRIA, July 8, 1999 by Francis Avnaim
    Report of a committee charged with choosing a search engine for INRIA (French National Institute for Research in Computer Science and Control), covering over 200,000 pages. Mainly compares AltaVista Search with Verity.

  • Verity add-on makes portals easier to build, navigate Infoworld, March 22 1999 by Emily Fitzloff
    Product announcement for Verity Knowledge Organizer, which lets catalogers classify data and add pages to categories automatically.

  • Verity Rebounds, Ships Document Management Duo Jim Kertetter, PCWeek Online December 7, 1998
    Describes introduction of KeyView Pro 6.5 and HTML Export 2.0, and covers corporate problems in previous quarters.

  • CDnow Improves Search Tooltipworld.com Internet Daily, July 14, 1998
    Even if you can't spell "potato," new search capabilities in the music database at online retailer CDnow Inc. will make it easier for you to find your music. Using new search technology from Verity Inc., visitors to the CDnow (CDNW) site will be able to locate artists' information even if they can't spell the name correctly. "Can't find Alanis Morisette at other music stores? CDnow is the only online music store that will find you Alanis Morissette," said Chief Executive Jason Olim.

  • Verity search toolkit debuts InfoWorld, April 27, 1998 by Torsten Busse
    Describes new search toolkit, Verity K2, based on multi-node symmetrical multiprocessor hardware and multithreaded OS architectures. It allows topic searches across multiple data sources and results in less than two seconds even for hundreds of thousands of documents.

  • Lotus, Verity settle lawsuit, sign licensing agreement. InfoWorld Electric, June 8, 1998 by Clare Haney
    Verity had accused Lotus of breaking a license agreement and incorporating advanced search features in Lotus Notes 5.0. In the settlement, Lotus will license the KeyView filtering technology.

  • K2 Aids Data Retrieval PC Week, April 24, 1998 by Christy Walker
    Describes the Verity K2 toolkit for knowledge management.

  • Serving Up Quality Searches Internet Computing, March, 1998 by Kevin Railsback
    Comprehensive comparative review of AltaVista eXtension 97, Integrated I-Search 3.0, Microsoft Index Server 2.0, Innotech Net Results 1.2, Verity Search 97 Information Server and Netscape Compass 3.0 (recommended).

  • Verity at TREC-6 (PostScript GZ file): NIST Special Publication 500-240: The Sixth Text REtrieval Conference (TREC 6) 1997, by J.O. Pedersen, C. Silverstein, C.C. Vogt (Verity, Inc.) p. 259-274. Describes how the Verity search engine performed in standard tests of information retrieval, specifically targeting recall quality.

Partners