Home
SearchTools Blog
Recent Entries 
About SearchTools
The is the blog for SearchTools.com, a free and unbiased site about web site, enterprise and intranet search engines. These are smaller than public web search engines, such as Google or Yahoo -- they do the work behind the search fields on commerce sites, newspaper archives, government sites, etc.

For more information, please see the main SearchTools Site.

Search Tools Consulting offers advice to companies and individuals who need help choosing and configuring search engines. For more information, please leave a comment for any post, see the Consulting page or fill in the contact form

Technorati Profile
6th-May-2008 10:57 am - A First Taxonomy for "Search Log Junk"

Search logs contain a lot of weird things, and some of them can have a significant effect on search log analysis. Having looked at tens of thousand lines of search log entries, I offer this first attempt at defining some of the weirdest and least useful kinds of log entry, which I call "Search Log Junk". Here are the types of junk that I've seen most frequently:

Empty Queries
Queries without any query text or usable parameters. These can appear when people think the "Search" button is important in and of itself. Or perhaps search is in the first page form, and the cursor gets into that field and users press Return. These are often sent from the home page, according to the referer fields I've seen.

The first thing is to make sure that the search engine is doing something reasonable in this case. This could be just bringing up a helpful search page, adding a script to bring up an error dialog, or a script to ignore the empty query. I'm leaning towards the last option.

I've found only a couple of ways to use this information. They are still useful for traffic and response time metrics, and I think it's useful to check the top referring pages occasionally. A lot of empty queries for a page deep within a site may indicate some navigation problems.

Repeat Queries
Multiple identical queries to the search engine from the same IP or user ID. My best guess is that the client is calling for a refresh automatically -- my favorite was thousands of queries over months for two dots: "..".

Again, this is useful for traffic metrics and possibly for identifying really weird incoming links. For most situations, it won't affect the statistics in any important way. But if there are hundreds of repeat queries by the same client, removing them from the database allows you to concentrate on the real data. You may also want to ban that IP address.

Robot crawlers
Having search and intelligent agents crawl search results may be a good thing. Incoming links are always good and it may be that the search results on your site for emerald green widgets is number one in webwide search results and drives good traffic. However, there may be other robots wasting your search engine cycles: for those, a combination of robots.txt and banning their IP address will help.

Server hacks
Search engines are attacked by the standard web server hacking parameters, such as "phpmyadmin". They may also be subject to buffer overflow and other attacks, so should be included in standard website security audits and checklists.

Guestbook spam
There are automated advertising services that insert fake comments with URLs into form fields, guestbooks, blogs and wikis (and there's a wikipedia page about them). Many of them do the same with search fields, which explains why logs contain bizarre queries with spaces, HTML formatting and URLs in them.

For sites with light search traffic, these meaningless entries can cause problems with both traffic metrics and top query listings. Even for sites with thousands of queries per day, they can distort statistics about the average length of query, so removing them from your analysis database is a good idea.

It's fairly easy to identify these queries with simple regular expressions looking for href, http and .com. I haven't heard of any search engines which filter this, though some may be doing it without bothering their customers about it.

Internal testing queries
For light traffic sites, any kind of automated testing, or even heavy manual testing can change the search log significantly -- especially given how quickly the Long Tail shows up. Remove queries from testers by user ID or IP address to look at real user data.
4th-Mar-2008 03:22 pm - partly offline due to injury
I slipped on a stepladder and broke my left leg (tibial plateau fracture) and then chipped my right heel while on crutches. My office is not really wheelchair accessible, nor can I go down my house's steps without great effort, so I'm working remotely, part-time.

I am trying to read email every day and respond in a timely way, so if you've left a voice message or sent email that I have not answered, please try again (by email if possible). Apologies for your inconvenience.

Avi
23rd-Jan-2008 01:56 pm - where has Entopia gone?
One of my clients is interested in Entopia, so I was taking a look.

I tried to go to the web site and it was replaced by one of those placeholder spam sites which pops up several spammy windows. It seems like the kind of thing that might have viruses, worms or trojans, so I'd suggest against opening the site in IE, or really, at all on a Windows machine.

No one answered at one phone number, the other two I found were disconnected.

Casualty of the recession? Acquired by someone? It's a mystery, and I'm curious.

ETA: The Wayback Machine (archive.org) has an actual home page as of June 13, 2006 and an empty page as of July 1 of that year. I always thought they were promising more than they could deliver, so this is perhaps confirmation.
18th-Dec-2007 05:28 pm - Small updates to Search Tools reports
We've updated the following reports on search engines large and small in the last few weeks:
  • i411 has changed its name to Intelligenx and added autocatagorization and multiple language support.
  • Engenium now has OEM library and automatic clustering module.
  • FreeFind now has wildcards for excluding URL paths from indexing, indexes common office document file formats, relevance weight adjustments for URL paths (with wildcards), and some really nice indexing reports -- URLs extracted, server response, status, and which URLs are actually in the searchable index.
  • HomePageSearchEngine now indexes more file types.
  • Doclinx now has a web monitoring agent, with support for speech recognition, for research and competitive intelligence, and a language analyzer.
  • Boolean Search now runs natively on both PPC and Intel Mac OS X systems, includes web-based admin, spellchecking and match term highlighting in search results, template and AppleScript integration for search results formatting, standalone search server, and regular expressions in queries.
  • Crawl-it remote service is still being supported.
  • Datagold is no longer a separate search, it's part of an online archiving suite.
  • Educasoft has no indication of continuing development
19th-Sep-2007 04:58 pm - Search Conferences Listing updated
This list covers all the search and related related conferences I know about.

At the Enterprise Search Summit West I will be doing a pre-conference workshop on Critical Success Factors (how search engines work and how to make them better), a presentation on Tuning Search using Analytics and a moderating a panel on Good Practices for Search User Interfaces. At the Web Builder 2.0 conference, I'll be presenting on Web Site Search and the User Experience. If you are a reader of this web site, please come and say hi, and if you'd like an online presentation to your organization or company, I do those as well.

To suggest a conference or the listing, please leave a comment and I'll add it.
29th-Aug-2007 03:30 pm - Critique of the Google Custom Search Traffic Report

Edward Tufte would be disappointed in Google. The traffic reports in the Google Custom Search Business Edition are not only insufficient, but somewhat misleading.

Below is a picture from a CSBE search for a B2B site that I helped install in August 2007. The fact that it's a line chart, with no data points given, filled underneath,makes it look active. It seems as though something's happening, the traffic is making progress, or worse, losing ground. The deep dips look scary, as though the site has done something wrong.

examples here )

Edward Tufte wrote some enlightening books on these topics, including The Visual Display of Quantitative Information, which taught those of us paying attention that how data is presented deeply affects how it is received. I highly recommend getting some of Tufte's books, from Amazon, from Powell's or from your library (using WorldCat).

Please comment whether you agree or disagree. I'm haven't seen quite this problem in other search engine traffic reports, but I'm wondering what other interfaces might look like, and what you think is best. Tell me your opinions, please!

20th-Aug-2007 01:50 pm - Google Search Appliance and Mini - SearchTools Report Updated

I have updated my report on the GSA and Mini search appliances, with detail based in part on my recent experiences customizing a Google Mini. The report includes information on the pricing as far as I could find it, the terms of licensing, new features, links to informative documents, and features that are not included with the Mini appliance.

Once I update my full product review, I will have a chance to pay attention to other search engines, and that will be lovely.

16th-Aug-2007 03:03 pm - Google CSE - different results when searching more than three sites
A support document for the Google CSE (Custom Search Engine)and CSBE (Custom Search Business Edition) notes that some results may be different than those found in the same search on Google.com. It attributes this to including more than three sites in the CSE, and says that the CSE is using a subset of the Google.com index.

They recommend limiting the CSE to three sites, changing the behavior to 'Search the entire web but emphasize included sites', or adding refinements that have the same effect.

As of August 16, 2007, the support note says "We're working to bring more complete results to all Custom Search Engines.".
3rd-Aug-2007 10:42 am - Google Launches Site Search Service for Business
Google's Custom Search Business Edition uses the Google web search index limited by site or sites. It provides most of the Google web search features and is very cheap, only $100 per year for up to 50,000 pages, $500 for up to 500,000 pages. More here at my InfoToday article / more at the SearchTools Google Service report page.

What do you think of it?
19th-Jul-2007 04:16 pm - New Google hosted search with no advertising

Called the Google Custom Search Business Edition, this is a hosted site search, designed for small businesses with web site content, who don't want the advertising displayed on the older Custom Search Engine.

This version uses Google's existing index of the Internet, searching all the pages they know about it on the specified sites including non-HTML file types, using their query language, retrieval and relevance algorithsm, and searching in multiple languages and character sets. Like the web search engine, there is no way to index pages protected by access control such as passwords or ACLs.

The default interface customization is limited to a logo and colors of the results page border, title, background, text and links, but the XML results format is fairly configurable using the Google AJAX Search API. While there is no structure in place to display site advertising on search results, presumably one could do that very easily with XML results. Reports are limited to top queries and queries per day/week/month/all, but can be connected to the Google Activity Monitor site traffic analysis tool.

Note that Google will not guarantee that they'll crawl all of the pages of a particular site, update on-demand, or even update frequently. Using this service will not improve a site's position in the Google.com search results.

Pricing is $100 per year for up to 5,000 pages; $500 per year for up to 50,000 pages (both payable by credit card via Google Checkout). According to ecommerce-guide.com, it seems to go to a $15,000 per year fee for up to 1 million pages, but potential customers should contact the company. (Non-profits, university and government agencies can use the standard Custom Search and opt-out of advertising).

This page was loaded May 10th 2008, 5:55 am GMT.