All About Search Indexing Robots and Spiders</center>
Many search engines use programs called robots to gather web pages for indexing. These programs are not limited to a pre-defined list of web pages, they can follow links on pages they find, which makes them a form of intelligent agent. The process of following links is called spidering, wandering, or gathering.
- Controlling Robot Indexing
- Robot spiders cannot index unlinked files, so they will ignore all the miscellaneous files you may have in your web server directory. Webmasters can control which directories the robots should index by editing the robots.txt file, and web page creators can control robot indexing behavior using the Robots META tag.
- Following Links
- Local search robot spider indexers locate files to index by following links, just like webwide search engine spiders. You specify the starting page, and these indexers will request it from the server and received it just like a browser. The indexer will store every word on the page and then follow each link on that page, indexing the linked pages and following each link from those pages.
- Link Problems
- Dynamic Elements
- Robot spider indexers will receive each page exactly as a browser will receive it, with all dynamic data from CGIs, SSI (server-side includes), ASP (active server pages) and so on. This is vital to some sites, but other sites may find that the presence of these dynamic elements triggers the re-indexing process, although none of the actual text of the page has been changed.
Most site search engines can handle dynamic URLs (including question marks ? and other punctuation). However, most webwide search engines will not index these pages: for help building plain URLs, see our page on Generating Simple URLs .
- Server Load
- Because they use HTTP, robot spider indexers can be slower than local file indexers, and can put more pressure on your web server, as they ask for each page.
- Updating Indexes
- To update the index, some robot spider will query the web server about the status of each linked page by asking for the HTTP header using a "HEAD" request (the usual request for an HTML page is a "GET"). For HEAD requests, the server may be able to send the page header information from an internal cache, without opening and reading the entire file, and so the interaction may be much more efficient. Then the indexer compares the modified date from the header with its own date for the last time the index was updated. If the page has not been changed, it doesn't have to update the index. If it has been changed, or if it is new and has not yet been indexed, the robot spider will then send a GET request for the entire page, and store every word. An alternate solution is for robot spiders to send an "If-Modified-Since" request: this HTTP/1.1 header option allows the web server to send back a code if the page has not changed, and the entire page if it has changed.
- Duplicate Files
- Robots must contain special code to check for duplicate pages, due to server mirroring, alternate default page names, mistakes in relative file naming (./ instead of ../, for example), and so on. Some search indexers have powerful algorithms to identify these duplicates and only store and search one copy.
For more information, see the SearchTools Indexing Robot Checklist.
- Web Robots Pages
- Compilation of information about robots by the author of the Robots Exclusion Protocol, Martijn Koster. Slightly obsolete as of mid-1999. Includes the helpful Web Robots FAQ for Web users, authors and Robot implementors, Web Robots Database of User-Agent names and contact information, the Guidelines for Robot Writers and more.
Note that this is now hosted at the new site run by Martijn at www.robotstxt.org.
- W3C HTML 4.0 Specification, Appendix B, Notes on helping search engines index your Web site
- Standard information about data for indexing, robots.txt and META Robots tag.
SearchTools Robots Pages
- robots.txt Page
- Describes the robots.txt file format and implications for search indexing robots.
- META Robots Tag Page
- Describes the META Robots tag contents and implications for search indexing robots.
- Indexing Robot Checklist
- A list of important items for those creating robots for search indexing.
- Generating Simple URLs for Search Engine Robots
- All about URL Rewriting to convert dynamic URLs to simple path-based URLs.
- List of Robot Source Code
- Links to free and commercial source code for robot indexing spiders
- List of Robot Development Consultants
- Consultants who can provide services in this area.
- List of Books and Articles about Search Indexing Robots
- Links to writings on about Robots, Spiders and Crawlers.
Listings of Robot "User Agent" Names
- Web Robots Database
- List at robotstxt.org, may not be current.
- SearchEngineWatch SpiderSpotting Chart
- Displays User Agent and host names for webwide search engine robot spiders.
- Agents and Robots List - WebReference.com
- lightly annotated listing of agent and robot software
- Search Engine Robots
- Lists of search engines, agent names and their information links, updated fairly frequently.
- Other Good Sites
- Contains many listings of robots on the Web, links, articles and bibliographies, but is not well organized and is rarely updated.
- SearchEngineWorld tests of Robots.txt
- Common problems found when testing their robots.txt validator.
- Robots Mailing List - for writers of web robots
- To subscribe, send a message to firstname.lastname@example.org with the words subscribe robots (your name) in the message body.
For mailing list help, see the Listserv help message.
- To view earlier messages, see the Archive of discussions, 1995-1997
- Breadth-first crawling yields high-quality pages Proceedings of the WWW10 Conference, May 2-5 2001, Hong Kong, by Mark Najork and Janet L. Wiener
- When crawling large sites, you may not be able to get every page right away. The question becomes, should you follow links in the order encountered, or try to create a priority list? This paper describes an experiment comparing results using different crawls: breadth-first: order discovered; backlink: highest known links crawled first; PageRank: highest PageRank quality crawled first (as defined by Google founders Page and Brin); random: next page selected at random. This experiment was performed on 328 million pages on 7 million host servers, and found that the first few million pages encountered were the best, because "the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates." Therefore, there is little need to compute PageRanks for all available pages to crawl those first.