SearchTools Blog
Great overview article on Inverted Indexing for Search 
26th-Aug-2008 12:37 pm
searchtools.com
I am doing a talk about going inside the black box of the search index for the Enterprise Search Summit in September in San Jose (more on that later).

While I have a lot to say about indexes, I used the opportunity to check around and look for current research on the topic, and pretty much struck gold. Although this paper is from 2006, it is exhaustive and detailed, with both practical and theoretical information, including finding that inverted indexes are both significantly faster to search and easier to maintain than relational database management systems, signature files and suffix arrays. It also has a thorough annotated bibliography. Best of all, Zobel and Moffat agree with me on lowercasing all words in the index and including stopwords, which they say "have an important role in phrase queries".

Inverted files for text search engines
by Justin Zobel and Alistair Moffat
ACM Computing Surveys. 2006;38(2) (56 pages).
Available from: http://doi.acm.org/10.1145/1132956.1132959

Unfortunately, this article is firmly behind the ACM firewall, so if you or your institution don't have a subscription, you have to go through a few hoops to get it. Click the PDF link, you will be denied access and have to go through their free registration form. After that, there's a little form and you can buy the article for $10 by credit card. I think it's worth it.

ETA: improved my title
Comments 
26th-Aug-2008 09:25 pm (UTC) - One free copy on-line
I found it under the "PDF" link on this page: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.8844
28th-Aug-2008 12:17 am (UTC) - Re: One free copy on-line
Cool, though I didn't really mind paying for it. What are your thoughts on the content? Any arguments?
8th-Sep-2008 09:20 pm (UTC)
The article is good, however, I do not understand why Zobel and Moffat stubbornly describe bit-oriented compression codes and say nothing about word-oriented codes. Bit-oriented codes are 2-3 times slow than VBC or other byte- and/or word-oriented compression schemes, while the compression ratio is only marginally better. This is a double surprise for everybody who knows that Moffat (and probably Zobel) has a couple of publications regarding this issue, e.g. this one.


Edited at 2008-09-08 09:24 pm (UTC)
9th-Sep-2008 01:06 am (UTC)
Huh, I didn't know that. Have you asked them? I'm slowly plowing through the Anh article, as it's more technical than my usual.

Also, I found a brand-new book that seems to cover word-oriented compression, "Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. - http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html. What do you think of this one?

9th-Sep-2008 06:18 pm (UTC)
No, I have not asked. As to the book, sorry, I have not read it.
9th-Sep-2008 01:08 am (UTC)
Also, I improved the title of this post, because it was a bit misleading. What I like about the Zobel & Moffat paper is that it's a nice introductory overview, and in fact, not too technical.
This page was loaded Jul 19th 2009, 2:16 am GMT.