May 6th, 2008

A Taxonomy for "Search Log Junk"

Search logs contain wonderful bits of data about what the customers and visitors to a web site think they are looking for, and search log analysis is how to synthesize that data. Search logs contain a lot of weird things, and can have a significant effect on search log analysis, because they skew the statistics without adding any insight. Having looked at tens of thousand lines of search log entries, and with input from colleagues Lee Romero and Shaun Ryan, I have listed some least useful kinds of log entry, which I call Search Log Junk, as inspired by Edward Tufte's chart junk. I haven't heard of any search engines which filter before reporting, so it's up to search admins to clean our data for analysis.

Empty Queries
Queries without any usable query text or parameters can appear when people just click the "Search" button. Or perhaps search is the first form field, so the cursor gets into that field, and users press Return: it's a bit of a mystery. These entries can skew the search analytics, because they tend to return either no matches or all matches. Recommendations:

  • Track traffic and response time metrics both with and without empty queries.
  • Track referring URLsfor empty queries and analyze them periodically: if there are upward trends in specific locations, something in a page may be causing problems.
  • Remove empty queries from the base data set of queries and no-matches, for both metrics and analytics.
  • Design a user interface for empty queries, and test it. Should the result of an empty query be a no-matches page? A simple search page? A JavaScript error message? Should JavaScript catch the event and do nothing? It depends on your users, so whatever is proposed must be tested.
Prompt Text Queries
If there's a prompt in the search field: "search this site" or "catalog search" and no one thought it through, clicking on the search button or pressing Return in the field may send that query by default. If so, treat it like the empty query above, and fix it quick.
Repeat Queries
It's common to see multiple identical strings in queries to the search engine from the same IP or user ID. Some automated client or other software may calling for a refresh automatically -- my favorite was thousands of queries over months for two dots: .., but there are also random words and even complex query syntax. While a human user might re-open a search URL from time to time, anything more than five refreshes (not including page navigation) in ten minutes is likely to be junk. Recommendations:

  • Create a list of junk queries and patterns, with referer URLs and IP address or other identifying information -- though this may change. It's useful to include the dates they appeared, as well.
  • Track traffic and response time metrics both with and without these queries.
  • Remove these queries from the dataset for analytics.
  • To avoid wasting search resources on them, identity the worst offending IP addresses, and work with IT department or web hosting service to block them from the search engine.
Robot Crawler Queries
For public sites, having search and intelligent agents crawl search results may be a good thing. Incoming links are always good, and it may be that the URL of the search results on your site for emerald green widgets is number one in webwide search results and drives good traffic. However, these are more like incoming links than search queries, and you may want to remove them from the query dataset. In other cases, there may be random robots sending hundreds of requests and wasting your search engine cycles: treat those as repeat queries.
Server Hack Queries
Search engines are attacked by the standard web server hacking functions, such as phpmyadmin and inurl, and search requests with PUT or POST instead of GET. There also may be huge amounts of text in an attempt to overflow the server's buffer. Treat these as repeated query patterns for removal, but also work with your security team, web hosting service, and/or search vendor to test the search engine against these attacks.
Search Spamdexing
Spammers often insert fake comments with URLs into guestbooks, blogs and wikis (and there's a wikipedia page: Spam in blogs). Many of them do the same with search fields, which explains why logs contain bizarre queries with spaces, HTML formatting, square brackets, and URLs in them. These can distort query length metrics and analytics in general.

It's fairly easy to identify these queries with simple regular expressions looking for href, http and domain name patterns.Treat these as repeating queries and remove them before generating analytics reports.
Test Queries
Automated testing, or even heavy manual testing can change the search log significantly -- especially given how quickly the Long Tail shows up.As above, they add to traffic and response time, but should be removed from all other analysis. It's usually easy to identify internal testers and disallow them by IP address or user ID. For ad-hoc external tests, I recommend that everyone start by searching for their own name and/or a special testing string. This may not be enough to specify a formal search session, but it does indicate that something's happening.

In Praise of Search Log Data Cleansing

Search Metrics and Analytics provide amazing insights into user information needs, but only when they reflect actual user searches. Log junk, as defined above, skews metrics and wastes analysis time and effort. So it's worth setting up processes to clean out the junk and concentrate on the meaningful log entries.

Updated 2011-1-21

  • Current Mood