August 26th, 2008

Great overview article on Inverted Indexing for Search

I am doing a talk about going inside the black box of the search index for the Enterprise Search Summit in September in San Jose (more on that later).

While I have a lot to say about indexes, I used the opportunity to check around and look for current research on the topic, and pretty much struck gold. Although this paper is from 2006, it is exhaustive and detailed, with both practical and theoretical information, including finding that inverted indexes are both significantly faster to search and easier to maintain than relational database management systems, signature files and suffix arrays. It also has a thorough annotated bibliography. Best of all, Zobel and Moffat agree with me on lowercasing all words in the index and including stopwords, which they say "have an important role in phrase queries".

Inverted files for text search engines
by Justin Zobel and Alistair Moffat
ACM Computing Surveys. 2006;38(2) (56 pages).
Available from:

Unfortunately, this article is firmly behind the ACM firewall, so if you or your institution don't have a subscription, you have to go through a few hoops to get it. Click the PDF link, you will be denied access and have to go through their free registration form. After that, there's a little form and you can buy the article for $10 by credit card. I think it's worth it.

