SearchTools Blog (searchtools) wrote,
SearchTools Blog

SearchTools Report: Taxonomies, Classification, Categorization

Taxonomies, Categorization, Classification, Categories, and Directories for Searching


The terms taxonomy, ontology, directory, cataloging, categorization and classification are often confused and used interchangeably. These are all ways of organizing information (or things or animals) into categories.

There are a number of applications that can help people create taxonomies and place information objects within their categories, although the amount of automation can vary. Some programs simply allow anyone to manually add a URL to a specific category by submitting a site. Others allow human catalogers to create sophisticated rules to specify certain words and phrases which will place a page in a category. Others accept a "training set" within an existing taxonomy, and will place documents in categories based on similarities. Still others attempt to automate the entire process, grouping pages into topics based on programmatic evaluation of the contents.

When evaluating these applications, remember that they are simply software. No matter the elegance of the algorithms, a computer program can never truly understand the concepts involved in a page, as a human can do, and will sometimes place pages in the wrong categories. For example, one very automated system had an "Arts and Humanities" category which includes links to an Internet services consulting company and a singer-songwriter's personal home page (along with many more appropriate pages). To serve your site or intranet users, plan for a significant amount of human cataloging and editing.

Glossary and Definitions

A directory is an organized sets of links, like those on Yahoo or the Open Directory Project, which allows a web site to display the scope and focus of its content. A directory can cover a single host, a large multi-server site, an intranet or the Web. At each level, the category names provide instant context information to users. Rather than a simple list, such as the results of a search, drilling down into the more and more specific categories (for example Shopping > Clothing > Footwear > Athletic) explains how the pages fit into the larger set of information.

Categorization is the process of associating a document with one or more subject categories. So the entry for a page on cross trainer shoes could go into Running, Manufacturing, Sports Medicine, or Rushkoff, Douglas! All of these are legitimate, depending on the context.

Cataloging and Classification come from libraries, where specialists enter the metadata (such as author, date, title and edition) for a document, apply subject categories to it, and place it into a class (such as a call number) for later retrieval. These tend to be used interchangeably with Categorization.

Clustering is the process of grouping documents based on similarity of words, or the concepts in the documents as interpreted by an analytical engine. These engines use complex algorithms including Natural Language Processing, Latent Semantic Analysis, Bayesian statistical analysis, and so on.

A Thesaurus is a set of related terms describing a set of documents. This is not hierarchical: it describes the standard terms for concepts in a controlled vocabulary. Thesauri include synonyms and more complex relationships, such as broader or narrower terms, related terms and other forms of words.

Taxonomy is the organization of a particular set of information for a particular purpose. It comes from biology, where it's used to define the single location for a species within a complex hierarchic. Biologists have arguments about where various species belong, although DNA analysis can resolve most of the questions. In informational taxonomies, items can fit into several taxonomic categories.

Ontology is the study of the categories of things within a domain. It comes from philosophy and provides a logical framework for academic research on knowledge representation. Work on ontologies involves schema and diagrams for showing relationships in Venn diagrams, trees, lattices and so on.


Organizational Theory section of the Argus Center for Information Architecture
A splendid annotated bibliography of the best works on organizing information and classification as well as indexing, thesaurus construction and controlled vocabularies.

Content Wire Taxonomies News
Links to relevant articles, updated several times a month.

Guided Tour of Ontology
Definitions and background information for ontologies as part of the Semantic Web.

Taxonomy Primer from Lexonomy
Discusses controlled vocabularies for web sites. Includes recommendations for buying and building vocabularies, and applying them to navigation and searching. Special search features include "synonym rings", while a hierarchical arrangement is often known as a taxonomy. Provides additional tips and suggestions based on extensive experience.

Classification Society of North America

International Federation of Classification Societies

Bibliography on Automatic Text Categorization
Fabrizio Sebastiani, a research scientist at Consiglio Nazionale delle Ricerche (Italian National Research Council) provides a listing of scientific research articles on automatic classification and categorization.

Classification, Indexing, Metadata and Thesauri - link page at UMass Amherst

Taxonomy, Classification and Metadata Resources - link page at Scottish electronic Staff Development Library

See also: Faceted Metadata and Search

Articles and Overviews

  • Taxonomies for Practical Information Management NIE Enterprise Search Newsletter, April 25, 2003 by John Lehman
    A clear and concise description of the kinds of categories used in business (such as Industry Segment, Technologies, Geography) and a helpful checklist of the useful elements of a taxonomy. By a leader in the field of classification.
  • Ten taxonomy myths Montague Institute, November 27, 2002
    Taxonomy experts discuss the many kinds of taxonomies, separating content and user-oriented taxonomies and the process of creating taxonomies. Very useful advice based on long experience.

  • Unstructured Data Management: the elephant in the corner (guest or customer access required) the451 Report, November 2002 by Nick Patience and Rachel Chalmers
    This report describes the huge amounts of unstructured data in enterprise computer networks, wasted time re-creating this information, and the lack of tools equivalent to data mining, business intelligence and OLAP. Identifies four application sectors: content, document and knowledge management; search and retrieval; categorization, taxonomy and data visualization; XML databases. The analysts evaluate these sectors from a business perspective, defining strengths, trends, pricing and leading vendors. Point out that the Web expanded the size of the search market but did not sustain it, while the categorization market is volatile, with many small and recently-acquired companies. Analysts believe that the leading relational database vendors (IBM, Oracle and Microsoft) may be able to lead the unstructured data market as well. Describes Verity K2 as an integrated search and taxonomy system, preferred over Autonomy, which is a "Rolls Royce" company in knowledge management and collaboration, Ultraseek (Inktomi Enterprise Search), a lower-value search engine even with the Quiver taxonomy engine. In Categorization and Visualization, they feel that InXight is the best among a field that includes Antarctica, Applied Semantics, ClearForest, IBM Lotus Discovery, Mohomine, Entrieva (Semio), Stratify and The Brain.

  • Standards Target Categorization eWeek, July 15, 2002, by Jim Rapoza
    Describes the value of standards for categorization, and the value of RDF and other standards which describe the most important terms and categories for a document. While not yet widely adopted or supported, this will reduce the demands on content categorization engines.

  • Three Paths to Sorting Content eWeek, July 15, 2002 by Jim Rapoza
    Describes testing three categorization engines and compares approaches: Applied Semantics' Auto-Categorizer 1.1, Interwoven's MetaTagger 3.0 and Thunderstone's Texis Categorizer 4.1. Used three sets of content: science educational materials, government health documents, and analyst reports, in a variety of file formats. Worked through process step-by-step with vendors. Finds that they address different aspects, require technical integration in different ways. Recommends adding some category suggestions during authoring, adjusting standard taxonomies to individual company usage.

  • The difficulty of categorization Knowledge Management Connection, spring 2002
    Proposes several ways to make classification easier in organizations, including focusing on information seekers, leveraging existing taxonomies, using the right people and tools, and avoiding up-front universal structures.

  • Taxonomy & Content Classification: Delphi Group Report (guest or customer access required) Delphi, April 11, 2002
    Defines the increasing need for tools to organize information and avoid overload, especially with ambiguous words. Covers the evolution of technologies from simple to more complex search engines with extreme recall, metadata search, link ranking, and taxonomy-creation software for hierarchical structures. Describes how people find information, both for known item searches and for discovery about a topic, where arrangements of subject categories trigger associations and relationships. Interactive and iterative browsing within categories helps locate dynamic data which otherwise might be hard to find. Searching within is more focuses and avoids false matches. Taxonomies can integrate with internal applications such as proposal-generation, CRM and data mining. Provides a useful checklist for analyzing taxonomy tools. Survey results of 450 executives, IT or managers at large enterprise organizations with at least 50,000 documents shows: most spend more than 2 hours per day searching for information, 73% say that finding information is difficult, the main impediments were "bad tools" (28%) and "data changes" (35%). Describes the principals of automatic classification and taxonomy-building, with details on eleven products. NOTE: vendors paid Delphi to be included in this report.

  • Tools of the Trade eWeek April 1, 2002 by Lisa Vaas
    Describes the value of good taxonomies in the context of a portal. Describes the situation at the U.S. Geological Survey's Center for Biological Information, where, despite a biological taxonomy and expert staff, the project is stymied by internal issues such as security. The cost of automatic portal taxonomy building is a big block to the law firm of Hunton & Williams, as is the difficulty of defining content organization.

  • Taxonomies are what? Free Pint, October 4, 2001 (issue 97) by Liz Edols
    Some people now say that 'taxonomies are chic', this article provides excellent coverage of varying definitions and articles about taxonomies, examples and relationships to classification and indexing. Also lists taxonomy software.

  • Taxonomy: Creating Knowledge from Chaos; April 18, 2001 by "the Snark"
    Defines taxonomy as "the creation of structure (arrangement) and labels (name) to aid location of relevant information," which can be expressed in hierarchical categories. Within the context of web searching, sees XML as an enabling technology to express taxonomies. An example is Microsoft, which created a complete taxonomy of all information (manually) and reported a 62% reduction in the number of clicks required to find information and an 11% increase in the task success rate.

  • Managing Taxonomies Strategically Montague Institute Review, March 2001
    Summary outline of a longer article defining taxonomies and describing how they work.

  • Little Blue Folders Argus Center for Information Architecture; July 10 2001 by Peter Morville
    Discussion of the hyperbole and reality of automated categorization, with a strong recommendation for combining software-generated classification and human editorial judgment.

  • Online Taxonomies: The Next Shift for Information Science Content Wire; March 2001 by Paola Di Maio
    Meditation on the nature of online taxonomies, which should be dynamic, current, "referenceable" and scalable.

  • Word Wranglers Intelligent KM January 1, 2001 by Katherine C. Adams
    Survey of automatic classification software, as a support function for Enterprise Knowledge Management. Background includes the value of classification within enterprise, how taxonomies facilitate browsing, topical relevance, how classification technology works, and so on. Divides approaches up into rules-based, example-based and statistical clustering. Predicts that "cyborg" cataloging will combine automated tools with human judgment; taking advantage of XML metadata, and new technology such as Support Vector Machines (SVM). Table of features covers Autonomy, Cartia Themescape, Inxight, Metacode [bought by Interwoven], Mohomine, Semio and Verity.

  • Automatic Categorization of Magazine Articles Proceedings of Informatiewetenschap 1999; 12 November 1999 by Marie-Francine Moens and Jos Dumortier.
    Academic report on results of tests with automated classification. Compares their c2 system with Bayesian and Roccio algorithms, not surprisingly, theirs comes out better. Most interesting is the evaluation of how well the categorization performed compared to human classifiers. Examples, in Dutch, are included in the appendix.

  • Practical Taxonomies: Hard-won wisdom for creating a workable knowledge classification system (from Knowledge Management: Built to Order) January 1999 by Sarah L. Roberts-Witt
    Describes the challenges of creating a knowledge-classification system and the results of a survey of enterprise administrators who have done it. Includes quotes from consulting firms, government agencies and accounting firms. Recommends "broad and flat" organization as more straightforward, avoiding a search for perfection and allowing for evolution. Suggests a balance between human and automatic categorization.

  • Complexity In Indexing Systems -- Abandonment And Failure: Implications For Organizing The Internet ASIS Conference, October 1996 by Bella Haas Weinberg
    Describes the history of classification schemes from Dewey and UDC through Ranganathan and CIFT and the changes in the US Library of Congress Subject Headings approaches. Complex and individualized indexing schemes tend to move towards the "synthetic", providing standard subdivisions to be applied as appropriate, because of cost, training requirements, human inconsistency and rigidity due to theoretical flaws. Describes problems with applying traditional approaches to the vast and varying content of the Internet, and recommends applying back-of-the-book index approaches to the Web.

Classification Tools

Now listed on the Classification Tools page.

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.