May 11th, 2005

Search Indexing Spiders and Date Problems

Document Date Issues for Search Indexing

Search indexing spiders (aka robots and crawlers) follow links in HTML pages to find new pages. They also check known indexed pages to see if the content has changed. Generally, they do this by either getting the whole page again (HTTP GET) or to be more efficient, just the header (HTTP HEAD) or, even more so, send an "IF-MODIFIED-SINCE" (Conditional Get) request to get the whole page only if it's been updated since they last asked about it.

If the date reported is far past, the future, or the instant the indexer requests it, it makes the indexer waste cycles re-indexing unchanged content. Worse, it lies to searchers about the content currency, which is a vital element in assessing the value of a search result. Dates on web servers are not reliable which is one reason Google and Yahoo's Web Search results rarely even even show page date. Enterprise search can do better, if you can make the required changes on the server or publishing side.

Details of Search Indexing and Page Date Problems

Search Indexing Crawler Tests

SearchTools Test Suite provides a suite of test pages for how well search indexing robots can handle robot rules and complex linking. Many robots are easily confused by anything beyond a simple URL, so these test help us identify the ones with more sophistication.

In addition, these tests will tell us how many robots can handle text in ALT and Comment tags, HTML header tags such as Meta Keywords, and more.

To try out this system, we've coded each page with "RTest", and more specifically, with "RTestGood", for successful indexing and "RTestProblem" for pages which should not be indexed.

NOTE: I know many of these tests are now funky, and have been since the site changed servers. I will be updating them as I have time and energy. Please give me suggestions and comments and offers of help by replying to this blog entry.

Indexing Test Suite