Searching for Text Information in Databases
Databases provide the content storage for many sites, which dynamically create web pages around them, including ecommerce catalog sites, online news, and even entertainment sites. Intranets often contain large amounts of text stored in databases as well.
These databases generally have their own search functions, which may appear to take the place of a full-text search engine. But that's not always the case. Database search is not oriented towards text search and relevance ranking: it is great for locating widgets by part number and listing the inventory of leather slippers, but not so good at helping site visitors find the articles on widget quality or comparing leather and fleece slippers. Text search engines are often the right solution, moving the processing load from the database to the search engine.
Database search vs. text search
Searching Many Fields and Tables
Databases store their information neatly organized into fields, such as product name, category, description, price and so on. However most people don't like to choose a field before searching: who knows whether widget is in the name, category, or description field? While databases can set up complex queries to find the search words in all applicable fields, this makes them slower to respond, requires more memory, and is more difficult to program. Text search engines store this information in a single index and can find words in any field for a record. Many high-end search engines can also store field information, so searches can be limited to a specific field as well.
Simple Search Commands
In many cases, users must type in complex Boolean or SQL (Structured Query Language) commands for searching. Some databases, such as MySQL, are limited to "Or" searches -- they will return all records which match any of the search terms. Others default to exact phrase matching, so a search for fuzzy slippers would not find slippers, fuzzy.
While some programmers and librarians enjoy the control that query languages give them, most people do not. This is particularly important for searching multiple words: if someone types in brown bear, they probably want to see records with bear with a brown coat in them as well, without typing bear AND brown. Search logs show clearly that most people will not do anything complicated while searching -- they will give up if the search is too hard to use. Full-text search engines offer simple and flexible search options, with most providing an Advanced Search feature for power users.
Flexible Search Processing
Database search functions tend to look for exact matches in capitalization and characters. If someone searches for pokemon, they won't find records with Pokemon or pokémon in them. Many text search engines will automatically match lowercase searches with any text, and will adjust for extended and diacritical characters. Some also allow search administrators to set up synonyms for searching (doctor and physician), and automatically perform stemming (find octopi when searching for octopus). A new trend in text search engines is to incorporate spelling checkers, which are particularly useful for queries which do not find any matches. By checking the words already in the index, the search engine spelling checker can offer suggestions which apply to this particular content, which is extremely useful.
Some databases even return a list of everything they have if a user enters an empty search string -- this requires significant server processing time. Search engines tend to just report that they found no matches.
Many databases are not designed for easy searching. In a wine database, for example, it may be difficult to find wines from a certain region, such as the Napa valley of California, because the location is stored in a table far away from the wine names. Some text search engines, such as dtSearch, FAST Data Search, Ultraseek, and Verity K2, provide a rich data set for searching. Mercado's IntuiFind provides options to entirely rearrange a database structure, if that's what's necessary for searching.
Response Time and Database Resources
Databases are optimized to search for exact words and phrases, and they tend to respond very slowly otherwise. So if a searcher wants to find sheepskin or shearling in the same search, databases will tend to do two searches and then merge the results. Full text search engines are designed to store these words in a single index, so they can perform these kinds of searches efficiently and return quickly.
In addition to the time, searching databases requires additional back-end and server resources. By storing a search index on a separate server and searching that rather than the live data, a text search engine can perform queries without additional demands on the database itself.
Results Per Page
Many database search engines will happily display all results on the same page, whether there are 8 or 8,000 records. Text search engines have a mechanism for dividing up the results and providing navigation from the first page to following pages (and back).
Sorting by Relevance
Text search engines sort by relevance, as determined by the number and location of matched words in the result page or record. Database search functions sort by size, or price, or date, or the order in which the items were entered in the database!
Many text search engines can sort results by date as an option, and a few of them can sort by price, size, geographic location, etc.
Connecting Full Text Search Indexer to the Database
To add a text search engine to a database-generated site, the engine can connect to the database directly, using SQL, ODBC, JDBC or a native connector. This is efficient because it reduces the overhead of going through the HTTP server and creating a new session for every record. It also contains only the content text, rather than navigation links, copyright information and other inappropriate content. In some cases, the database can track new and changed records, and only give this new data to the search engine indexer, on demand or as part of an application interface.
If there is no way to connect directly to the database, a search engine spider can crawl the pages generated from the database, viewing the pages as a browser would. In some cases, this can be a special crawl for pages designed just for indexing, whether detected by the server or as a special set of URLs that are then converted to user-viewable URLs. Again, the database should provide only new and changed records, rather than all records every time.
Tools for Extracting Database Content
- Quigo Intelligence - interfaces with databases and makes the content visible to search engine crawlers or as XML feeds while retaining the valuable structures.
- Your Amigo - has indexing agents which can correlate database access forms with backend database APIs, either for the internal search engine or as a feed to other search engines.
- Albert web - interfaces to Oracle, Sybase, mySQL, PostgreSQL, Informix, MS-SQL Server, ODBC and dBase formats
- Aurweb combines database and free-text search.
- Convera RetrievalWare and Web Express - native bridges to Oracle, Sybase, Informix and MS SQL, ODBC bridge
- Dieselpoint Search - JDBC connector to databases including DB2, Oracle, Informix, MS SQL, MySQL, PostgreSQL, and Cloudscape.
- Domino (Lotus) Extended Search - can query Lotus Notes, IBM DB2, Oracle, Sybase, MS SQL Server, MS Access, etc.
- dtSearch Web - ODBC
- EasyAsk - All ODBC Data Sources; native SQL for Oracle, Sybase, DB2, Informix and SQL Server; supports Flat Files and Database Views
- Endeca - faceted metadata search, dynamic taxonomies with database attributes
- FAST Data Search - JDBC and ODBC connectors to Oracle and other DBMSs.
- IMP Database Search Engine - Replaces Microsoft SQL Server 2000's full-text search.
- interMedia (Oracle multimedia search engine)
- Isys:web - ODBC
- Mercado IntuiFind - ODBC, JDBC and more
- MondoSearch - Microsoft Commerce Server partner
- MPS Information Server - various external parsers
- Muscat - Oracle gateway
- OMNIDEX - ODBC, JDBC, JNI and the OmniAccess API. This search engine does not require Boolean or SQL, groups many fields, case insensitive, diacritical conversion, etc.
- Oracle Search - text and multimedia search engines
- Plumtree Portal Search Server (formerly Ripfire) - JDBC to Oracle, Sybase, Informix and SQL Server, etc.
- Siderean Seamark Metadata Search (formerly Teapot) - faceted parametric search shows alternate aspects from database fields
- Thunderstone Webinator and Texis RDBMS - text search integrated into high-end database software
- Triplehop MatchPoint (proprietary crawler)
- Ultraseek - JDBC, also ODBC on Windows NT/2000 and direct Oracle access on Unix.
- Verity - ODBC, Sybase has a special API
- YourAmigo - flexible and customizable tools to extract content from complex databases, JDBC and ODBC tools.
Databases with Search Engine Interfaces
These databases incorporate a full-text search function in some way.
- IXIASOFT TEXTML database -XML database with real-time index synchronization, grouping of elements and attributes in indexes. Search features include Boolean, proximity and frequency operators. Results sorting, site is not clear on their relevance algorithm.
- Jasmine (Computer Associates) - AmTSG Controlled Vocabulary Search
- MySQL - handles simple and Boolean queries, returns results sorted by relevance, using a TV/IDF ranking algorithm.
- Oracle Search - various text and multimedia search engines
- SQL Server
- Sample Full-Text Search Engine SQL Server Magazine, July 2000 by David Jones
Describes how to implement a searchable table.
- Sybase - Verity API
- Thunderstone Texis and Webinator - high-end RDBMS with integrated search functionality.