Log in

No account? Create an account
SearchTools Blog
A Taxonomy for "Search Log Junk" 
6th-May-2008 10:57 am

Search logs contain wonderful bits of data about what the customers and visitors to a web site think they are looking for, and search log analysis is how to synthesize that data. Search logs contain a lot of weird things, and can have a significant effect on search log analysis, because they skew the statistics without adding any insight. Having looked at tens of thousand lines of search log entries, and with input from colleagues Lee Romero and Shaun Ryan, I have listed some least useful kinds of log entry, which I call Search Log Junk, as inspired by Edward Tufte's chart junk. I haven't heard of any search engines which filter before reporting, so it's up to search admins to clean our data for analysis.

Empty Queries
Queries without any usable query text or parameters can appear when people just click the "Search" button. Or perhaps search is the first form field, so the cursor gets into that field, and users press Return: it's a bit of a mystery. These entries can skew the search analytics, because they tend to return either no matches or all matches. Recommendations:

  • Track traffic and response time metrics both with and without empty queries.
  • Track referring URLsfor empty queries and analyze them periodically: if there are upward trends in specific locations, something in a page may be causing problems.
  • Remove empty queries from the base data set of queries and no-matches, for both metrics and analytics.
  • Design a user interface for empty queries, and test it. Should the result of an empty query be a no-matches page? A simple search page? A JavaScript error message? Should JavaScript catch the event and do nothing? It depends on your users, so whatever is proposed must be tested.
Prompt Text Queries
If there's a prompt in the search field: "search this site" or "catalog search" and no one thought it through, clicking on the search button or pressing Return in the field may send that query by default. If so, treat it like the empty query above, and fix it quick.
Repeat Queries
It's common to see multiple identical strings in queries to the search engine from the same IP or user ID. Some automated client or other software may calling for a refresh automatically -- my favorite was thousands of queries over months for two dots: .., but there are also random words and even complex query syntax. While a human user might re-open a search URL from time to time, anything more than five refreshes (not including page navigation) in ten minutes is likely to be junk. Recommendations:

  • Create a list of junk queries and patterns, with referer URLs and IP address or other identifying information -- though this may change. It's useful to include the dates they appeared, as well.
  • Track traffic and response time metrics both with and without these queries.
  • Remove these queries from the dataset for analytics.
  • To avoid wasting search resources on them, identity the worst offending IP addresses, and work with IT department or web hosting service to block them from the search engine.
Robot Crawler Queries
For public sites, having search and intelligent agents crawl search results may be a good thing. Incoming links are always good, and it may be that the URL of the search results on your site for emerald green widgets is number one in webwide search results and drives good traffic. However, these are more like incoming links than search queries, and you may want to remove them from the query dataset. In other cases, there may be random robots sending hundreds of requests and wasting your search engine cycles: treat those as repeat queries.
Server Hack Queries
Search engines are attacked by the standard web server hacking functions, such as phpmyadmin and inurl, and search requests with PUT or POST instead of GET. There also may be huge amounts of text in an attempt to overflow the server's buffer. Treat these as repeated query patterns for removal, but also work with your security team, web hosting service, and/or search vendor to test the search engine against these attacks.
Search Spamdexing
Spammers often insert fake comments with URLs into guestbooks, blogs and wikis (and there's a wikipedia page: Spam in blogs). Many of them do the same with search fields, which explains why logs contain bizarre queries with spaces, HTML formatting, square brackets, and URLs in them. These can distort query length metrics and analytics in general.

It's fairly easy to identify these queries with simple regular expressions looking for href, http and domain name patterns.Treat these as repeating queries and remove them before generating analytics reports.
Test Queries
Automated testing, or even heavy manual testing can change the search log significantly -- especially given how quickly the Long Tail shows up.As above, they add to traffic and response time, but should be removed from all other analysis. It's usually easy to identify internal testers and disallow them by IP address or user ID. For ad-hoc external tests, I recommend that everyone start by searching for their own name and/or a special testing string. This may not be enough to specify a formal search session, but it does indicate that something's happening.

In Praise of Search Log Data Cleansing

Search Metrics and Analytics provide amazing insights into user information needs, but only when they reflect actual user searches. Log junk, as defined above, skews metrics and wastes analysis time and effort. So it's worth setting up processes to clean out the junk and concentrate on the meaningful log entries.

Updated 2011-1-21

11th-May-2008 10:18 pm (UTC) - Default Text in search box
This is a good post Avi,
One small point to add to the point you made about the empty queries. If you have some text in your search box - like "enter your search here" then that will inevitably be one of the most popular queries in your search logs and should be treated the same as an empty query.
12th-May-2008 10:18 pm (UTC) - Re: Default Text in search box
Good point, though depressing.

A much much better User Experience would be a JavaScript to catch that case and bring up a confirmation dialog (Do you want to search this site? OK/Cancel) and redirect to a useful page...
14th-May-2008 02:31 am (UTC) - Re: Default Text in search box
I agree with you about the JavaScript. Potentially it should also catch the case for when there is an empty query and advise the user that they should enter something in the search box then press the search button. This should eliminate the empty queries in the box.
17th-Oct-2008 01:03 am (UTC)
When I enter a search term into the search box the popup suggestion box that drops down covers the "go" and "search" buttons.
17th-Oct-2008 08:52 pm (UTC)
Interesting, I hadn't thought of that problem with autocomplete. This is Wikipedia, right? That menu is pretty much a Go menu, because it's giving article title matches. As soon as you type some text (except spaces) that doesn't match an article title, the menu goes away. But they ought to have a special case if you type more spaces, to recognize that you want something outside the menu.

Also I noticed that their autocomplete shows all the redirect pages, even those with minor upper/lowercase variations. Four different items pointing at one page, for "Rachel Wei", not exactly a great user experience.

Most site searches don't have a Go concept, and if they do autocomplete, they have more room for the Search button.
17th-Oct-2008 03:55 am (UTC)
All you have to do as a site owner is put up your site, get links, and let the crawlers in. They'll do the magic.
7th-Jul-2008 01:43 pm (UTC) - Search Engines
Thanks for this post, it is very useful. Do you have any experience with search engines like ESP, from FAST (http://www.fast.no), or i411, now is Intelligenx (http://i411.com)? Can you recommend one of them?
Thanks again!

Helcio Filho!!!
8th-Jul-2008 11:52 pm (UTC) - Re: Search Engines
I have worked a bit with FAST (now bought by Microsoft) and quite extensively with Intelligenx. Other faceted metadata enterprise search engines include Endeca, Siderean, DieselPoint and the open-source Solr search engine.

Which one is best for you depends a lot on your information needs, your current and anticipated content size, your users, your developers and tools platforms, etc. etc,

You are welcome to contact me directly for a bit more information, and I am available for consulting at a reasonable fee.

Avi Rappoport
Search Tools Consulting
30th-Jul-2008 04:21 am (UTC) - Thanks for sharing
I loved your post. We once faced the repeat query problem with one of our clients' site. Thankfully, our programmers sorted out the problem.
30th-Jul-2008 04:12 pm (UTC) - Re: Thanks for sharing
You're very welcome, I'm glad to hear it helped.
This page was loaded May 21st 2018, 5:07 am GMT.