?

Log in

No account? Create an account
SearchTools Blog
Stop Words Must Go (WikiMedia Search Analysis) 
14th-Oct-2008 05:50 pm
searchtools.com

The MediaWiki search defaults to excluding 547 words as stopwords. But they're perfectly good words (you can see them on the searchtools site). It's a MySQL full-text search default, and the MediaWiki people have never changed it. Exactly like the short words in the previous rant, these words are not indexed at all, so can never be retrieved by the search engine. Stop words include: able, about, above, according, across, actually, after... So a site search containing only one or more of those words has "No page text matches", even when there are pages with those words.

Example at knoppix.net, tried the seven stopwords above, not one match

This message is not just unhelpful, it's misleading. It doesn't even say which of the search terms are stop words, so there's no way to tell except trial and error (or looking at the list). But, contrary to the message, specifying a search with an allowed word and a stopword or two, such as surprise from behind will match all articles containing the word surprise, without checking that the article also includes from and behind. Whoops.

There's a wikimedia meta help page with the awkward title of, Common words, searching for which is not possible. I find this all pretty user-hostile, and I think it stinks.

The main Wikipedia removed stopwords from search in February 2006. They don't say exactly why, though I find it blindingly obvious. But the MediaWiki installation still uses the giant stopword list. To fix it, reconfigure MySQL, or try the procedures some nice user has posted. Reduce the stopwords list to reasonable minimum (the, a, an, and, or, not), or leave it out altogether. Or switch to Sphinx or MWSearch (Lucene) which have fewer stopwords by default and can be set exclude none at all.

Comments 
16th-Oct-2008 05:55 pm (UTC) - Even a short stopword list has problems
Even a shorter list has serious problems. After I realized that the movie "Being There" was all stopwords, I started compiling a list. That list is posted here:

http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

Stopwords are a performance/scaling hack, not a relevance improvement. We've had better ways to do that for a long time.

16th-Oct-2008 08:56 pm (UTC) - Re: Even a short stopword list has problems
good point, once you get into inflections of "be" and "have", you're toast.
This page was loaded May 21st 2018, 5:02 am GMT.