Search logs contain a lot of weird things, and some of them can have a significant effect on search log analysis. Having looked at tens of thousand lines of search log entries, I offer this first attempt at defining some of the weirdest and least useful kinds of log entry, which I call "Search Log Junk". Here are the types of junk that I've seen most frequently:
- Empty Queries
- Queries without any query text or usable parameters. These can appear when people think the "Search" button is important in and of itself. Or perhaps search is in the first page form, and the cursor gets into that field and users press Return. These are often sent from the home page, according to the referer fields I've seen.
The first thing is to make sure that the search engine is doing something reasonable in this case. This could be just bringing up a helpful search page, adding a script to bring up an error dialog, or a script to ignore the empty query. I'm leaning towards the last option.
I've found only a couple of ways to use this information. They are still useful for traffic and response time metrics, and I think it's useful to check the top referring pages occasionally. A lot of empty queries for a page deep within a site may indicate some navigation problems.
- Repeat Queries
- Multiple identical queries to the search engine from the same IP or user ID. My best guess is that the client is calling for a refresh automatically -- my favorite was thousands of queries over months for two dots: "..".
Again, this is useful for traffic metrics and possibly for identifying really weird incoming links. For most situations, it won't affect the statistics in any important way. But if there are hundreds of repeat queries by the same client, removing them from the database allows you to concentrate on the real data. You may also want to ban that IP address.
- Robot crawlers
- Having search and intelligent agents crawl search results may be a good thing. Incoming links are always good and it may be that the search results on your site for emerald green widgets is number one in webwide search results and drives good traffic. However, there may be other robots wasting your search engine cycles: for those, a combination of robots.txt and banning their IP address will help.
- Server hacks
- Search engines are attacked by the standard web server hacking parameters, such as "phpmyadmin". They may also be subject to buffer overflow and other attacks, so should be included in standard website security audits and checklists.
- Guestbook spam
- There are automated advertising services that insert fake comments with URLs into form fields, guestbooks, blogs and wikis (and there's a wikipedia page about them). Many of them do the same with search fields, which explains why logs contain bizarre queries with spaces, HTML formatting and URLs in them.
For sites with light search traffic, these meaningless entries can cause problems with both traffic metrics and top query listings. Even for sites with thousands of queries per day, they can distort statistics about the average length of query, so removing them from your analysis database is a good idea.
It's fairly easy to identify these queries with simple regular expressions looking for href, http and .com. I haven't heard of any search engines which filter this, though some may be doing it without bothering their customers about it.
- Internal testing queries
- For light traffic sites, any kind of automated testing, or even heavy manual testing can change the search log significantly -- especially given how quickly the Long Tail shows up. Remove queries from testers by user ID or IP address to look at real user data.