October 24th, 2008


Should Highlight Only Actual Match Words (WikiMedia Search)

Why MediaWiki's Site Search Stinks, Reason #6

Highlighting Match Words in Context is a Good Thing

The MediaWiki search results include the search terms, extracted from the articles (with markup), to give a sense of context. As explained in Reason #5, the surrounding text can show how the words are used within the article. Using bold or color type-styling to emphasize the words themselves is another good innovation, again pioneered by the information science KWIC (keyword in context) system, and Google's search results. (For more information, see Matching Search Terms In Context.)

For example, to when looking up art bars, it's nice to know that that the "Call to the Bar" article is about legal issues, rather than the art world, and that Wikipedia does this properly.

results for search, showing art and bar in context

Substrings Are Not Match Words

A problem arises when the system does not use the same algorithm (programmatic rules) for identifying whole words in both the search query processing and the match term highlighting. In the MediaWiki default search, the retrieval engine is looking for whole words, while the the highlight code uses a simple string match, so it marks parts of words. This is wrong.

For example, looking up insomnia in men on the Psychology wikia shows the substring men within the the words mental, treatment and element. In this case, the search engine itself is ignoring men because it is a short word. Far from being transparent, this result formatting is actually misleading the users about how the search engine functions.

psychiatriy wikia results for men treated as substrings

Wikipedia does not do this any more. They must have fixed this in the last two weeks, because it used to look like this:

wikipedia results with the letter "M" highlighted

Stopwords Are Not Match Words

For example, searching on the Tolkien Gateway wiki for the words elves well come is a problem, because the words: well and come are on the stopword list from hell. So the search result shows many pages that do not conform to the search criteria, while the pages that really do match the search are buried at the bottom of the results. In this example, the word elves is treated as a substring of delves, and come as a substring of become, while the word well is marked but not actually part of the retrieval. There are many better pages of elf welcoming, so the substring highlighting is particularly misleading.

results with highlights in substrings


As far as I can tell, the default function for the highlighting is $wgAdvancedSearchHighlighting (which does it wrong). There's a supplement: &$wgSearchHighlightBoundaries which uses regular expressions to find the word boundaries. That's what the main Wikipedia and MediaWiki sites are doing now.

Upgrading MediaWiki software to a newer version may solve this problem, although version 1.13.2 and 1.14 alpha can still do it wrong. Note that including the MWSearch extension (or Lucene) will not, by itself, fix the highlighting problem. Switching to the Sphinx Search Extension fixes that along with many of the other problems with MediaWiki Search.

Please comment with questions, clarifications, even arguments, in this blog. This is the last of the big reasons for a bit, though I have yet more screenshots that make me boggle.

  • Current Mood