Vik Singh wrote an in-depth post about his comparison of open-source search engines
. He tested the default configurations for Lucene, zettair, sphinx, and Xapian, with a nod to sqlite.
In the feedback section, there are some interesting comments, and several experts on various open-source search engines pointing out that it's a bit odd to throw default settings at specialized content and expect to have a robust comparison. Otis Gospodnetić (one of the Lucene/Solr core developers) has an answer in: Open Source Search Benchmark
and Charlie Hull posted Xapian compared
The first corpus is approximately a million tweets, which are tiny: they are at the "very short" edge of any spectrum of content items. The table of metrics focuses on indexing speed, memory and disc size requirements.
The other source is about 200,000 journal article metadata items (about 300MB) from TREC-9 filtering track
. It covers indexing memory requirements, speed, and size, along with search memory requirements, time, and relevancy.
I do not understand why Vik is so focused on index size. I've found that it's much better to take more room for an enriched index, with gentle stemming, including stopwords, alternates for accented characters and ambiguous-term punctuation, position and field metadata, etc. Disk space and memory are so much cheaper these days, and a totally stripped-down index makes many search features impossible to implement.
Erlend Strømsvik on the comments to this post pointed out the importance of index add/delete, and that's where the indexing speed and disc footprint might be more of a significant issue. to me. Now we're all looking at near-real-time inserts, which may or may not be compatible with the basic architecture of a search engine.
Using tweets and medical metadata and abstracts (which are by nature intensively hand-crafted) seems a bit limited, I'd like to see a more heterogeneous corpus, including long html, ugly html, broken html, random office documents, etc. My default for gathering this kind of thing is the US Federal government, which has no copyright as such.
In addition to out-of-the-box, it would be very interesting to see comparisons for lightly-tuned search engines, maybe no more than 20 or 30 configuration line change It's not a matter of fairness, it's just a more valuable comparison, starting from about the same place.
This test uses the TREC corpus, which includes 63 sets of query terms and for each article, a judgment as to whether the article is very relevant, somewhat relevant, or not relevant to that query. Including an middle value reflects the ambiguous nature of search: for a lot of queries, there is no binary yes-or-no answer. Some matched items are less relevant than others, but they are not irrelevant. The DCG (discounted cumulative gain)
seems to be a good way to include the position in the search results when calculating the effectiveness of a search engine relevance algorithm.
There are a lot of excellent insights in this post, and even where we disagree it helps everyone clarify their thoughts on how to set up meaningful comparisons.
All the more reason to want the Open Relevance Project