<?xml version='1.0' encoding='utf-8' ?>
<!--  If you are running a bot please visit this policy page outlining rules you must respect. http://www.livejournal.com/bots/  -->
<rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/' xmlns:media='http://search.yahoo.com/mrss/' xmlns:atom10='http://www.w3.org/2005/Atom'>
<channel>
  <title>SearchTools Blog</title>
  <link>http://searchtools.livejournal.com/</link>
  <description>SearchTools Blog - LiveJournal.com</description>
  <lastBuildDate>Tue, 09 Mar 2010 02:56:22 GMT</lastBuildDate>
  <generator>LiveJournal / LiveJournal.com</generator>
  <lj:journal>searchtools</lj:journal>
  <lj:journalid>1461002</lj:journalid>
  <lj:journaltype>personal</lj:journaltype>
  <atom10:link rel='hub' href='http://pubsubhubbub.appspot.com/' />
  <image>
    <url>http://l-userpic.livejournal.com/90795156/1461002</url>
    <title>SearchTools Blog</title>
    <link>http://searchtools.livejournal.com/</link>
    <width>16</width>
    <height>16</height>
  </image>

<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/91002.html</guid>
  <pubDate>Tue, 09 Mar 2010 02:56:22 GMT</pubDate>
  <title>The UK Web Archive</title>
  <link>http://searchtools.livejournal.com/91002.html</link>
  <description>The British Library and IBM are working together on the &lt;a href=&quot;http://www.webarchive.org.uk&quot;&gt;UK Web Archive&lt;/a&gt;, which will store all accessible UK web pages, providing researchers with a great datasource of British academia, opinions and popular culture that may change radically or disappear without notice. &lt;br /&gt;&lt;br /&gt;IBM is providing software expertise, and using it as a testbed for text-mining Big Data, estimating that it will be 220 Terabytes per year as of 2011. &lt;a href=&quot;http://www.ibm.com/software/ebusiness/jstart/bigsheets/index.html&quot;&gt;BigSheets&lt;/a&gt; (presumably a pun on BigTables) includes both open and closed source software.  They have shown various interfaces including spreadsheets, tag clouds, and mutli-bubble charts.&lt;br /&gt;&lt;br /&gt;I wrote an article about it for InfoToday: &lt;a href=&quot;http://newsbreaks.infotoday.com/NewsBreaks/British-Library-and-IBM-Team-Up-on-Web-Archiving-Project-65787.asp&quot;&gt;British Library and IBM Team Up on Web Archiving Project&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;&lt;em&gt;Some of my thoughts that didn&apos;t make it into the article:&lt;br /&gt;&lt;/em&gt;&lt;br /&gt;The British Library is a Legal Depository, holding one copy of each book and book-like object that&apos;s published in the UK and Ireland.  In the past, their archive was limited to sites where they have managed to find the owners and get permission to copy, so about six thousand, including companies which no longer have an independent existence, such as the Woolworth&apos;s site, which has since been removed from the web. Obviously web search engines and the &lt;a href=&quot;http://archive.org&quot;&gt;Internet Archive&lt;/a&gt; have taken a different approach.  But while the &lt;a href=&quot;http://www.opsi.gov.uk/acts/acts2003/ukpga_20030028_en_1&quot;&gt;UK Legal Deposit Libraries Act&lt;/a&gt; in 2003 seems to give them permission, it yet hasn&apos;t been enacted, and they&apos;ve been in legal limbo.  The announcement seems to be a way to pressure their parent department of Culture, Media, and Sport to implement the new rules as soon as possible.  For more details, see the Wired UK article: &lt;a href=&quot;http://www.wired.co.uk/news/archive/2010-03/05/archiving-britain%27s-web-the-legal-nightmare-explored.aspx&quot;&gt;Archiving Britain&apos;s web: The legal nightmare explored&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The UK Web Archive will honor &lt;a href=&quot;http://robotstxt.org&quot;&gt;robots.txt convention&lt;/a&gt; and the &lt;a href=&quot;http://www.sims.berkeley.edu/research/conferences/aps/removal-policy.html&quot;&gt;Oakland Archive Policy&lt;/a&gt; developed by the Internet Archive and the UC Berkeley Information School.  &lt;br /&gt;&lt;br /&gt;Published responses to the Archive announcement have ranged from the positive: &lt;a href=&quot;http://technology.timesonline.co.uk/tol/news/tech_and_web/article7041527.ece&quot;&gt;British Library launches UK internet archive&lt;/a&gt; &lt;em&gt;The UK&apos;s national library has created a fascinating snapshot of the way Britons have been using the web since 2004&lt;/em&gt;  to the alarmist: &lt;a href=&quot;http://www.thestandard.com/news/2010/02/25/uk-web-archive-will-offer-just-1-websites-2011&quot;&gt;UK Web Archive will offer just 1% of websites by 2011&lt;/a&gt; to the negative: &lt;a href=&quot;http://www.theregister.co.uk/2010/02/25/british_library_web_gobble/&quot;&gt;British Library wants taxpayer to gobble the web&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Software mentioned by IBM relating to &lt;a href=&quot;http://www.ibm.com/software/ebusiness/jstart/bigsheets/index.html&quot;&gt;BigSheets&lt;/a&gt;: &lt;ul&gt;&lt;li&gt;&lt;a href=&quot;http://hadooop.apache.org&quot;&gt;Hadoop&lt;/a&gt; open source scalable data handling&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href=&quot;http://hadooop.apache.org/pig&quot;&gt;Pig Latin&lt;/a&gt;open source  query language for Hadoop&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href=&quot;http://lucene.apache.org/nutch&quot;&gt;Nutch&lt;/a&gt; open source web crawler&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href=&quot;http://www.opencalais.com/&quot;&gt;Open Calais &lt;/a&gt;- not open source, but freely available from ThompsonReuters&lt;/li&gt;&lt;br /&gt;&lt;li&gt;IBM &lt;a href=&quot;http://www.ibm.com/software/data/infosphere/&quot;&gt;InfoSphere&lt;/a&gt; for classification&lt;/li&gt;&lt;br /&gt;&lt;li&gt;IBM &lt;a href=&quot;http://manyeyes.alphaworks.ibm.com/manyeyes/&quot;&gt;ManyEyes&lt;/a&gt; for visualization&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;There&apos;s also some completely un-sourced claims that &lt;strong&gt;&apos;the average life expectancy of a website was just 44 to 75 days, and suggested that at least 10% of all were either lost or replaced by new material every six months,&apos;&lt;/strong&gt;.  I have some leads on where this information came from, and it looks quite old, as in possibly from 1998.  Anyone out there have actual research data?&lt;br /&gt;</description>
  <comments>http://searchtools.livejournal.com/91002.html</comments>
  <lj:mood> </lj:mood>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/90776.html</guid>
  <pubDate>Thu, 25 Feb 2010 20:16:27 GMT</pubDate>
  <title>Indexing web pages with Solr, like magic</title>
  <link>http://searchtools.livejournal.com/90776.html</link>
  <description>Once upon a time, not so very long ago, the only people who could get Solr to index HTML and plain text files were wizards skilled in the ways of compiler compatibility, library dependencies, and the dreaded &lt;strong&gt;make&lt;/strong&gt;.  &lt;br /&gt;&lt;br /&gt;Fear no more!  As if by magic, it&apos;s now quite simple.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Firstly, the Solr project has incorporated the &lt;a href=&quot;http://lucene.apache.org/tika/&quot;&gt;Apache Tika&lt;/a&gt; code, which can open and read text from most of the most popular file formats, including plain text, XML, PDF, Microsoft Office formats, and even HTML.  &lt;a href=&quot;http://lucene.apache.org/tika/&quot;&gt;Tika&lt;/a&gt; can do more, it&apos;s a content analysis toolkit, but for Solr purposes, it&apos;s the opening and reading that matters.  &lt;br /&gt;&lt;br /&gt;The Solr interface for Tika is &lt;strong&gt;Solr Cell &lt;/strong&gt;(in the source code, &lt;a href=&quot;http://wiki.apache.org/solr/ExtractingRequestHandler&quot;&gt;ExtractingRequestHandler&lt;/a&gt;), and it just works.  You call Solr update with the addition of the RESTful path /extract, give it  file name, a few parameters, and zing!  it&apos;s indexed.  If you use a corresponding schema, not only is the text indexed, but internal metadata (like title tag) and external metadata, (like file name and size), are also stored as fields which are can be indexed, stored, and searchable. &lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Secondly, I have written a tutorial on exactly how to use Solr Cell to index text and HTML files, using the curl command line utility.  It walks through these steps:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;changing the example LucidWorks or Solr Tutorial schema to store full text&lt;br /&gt;&lt;li&gt;test-indexing a local XML file&lt;br /&gt;&lt;li&gt;indexing a local text file&lt;br /&gt;&lt;li&gt;indexing all text files in a folder&lt;br /&gt;&lt;li&gt;indexing a local HTML file&lt;br /&gt;&lt;li&gt;Indexing a remote HTML file as a web page.&lt;br /&gt;&lt;br /&gt;and has lots of suggestions on where to go next.  &lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;The tutorial is free to everyone, thanks to Lucid Imagination who paid me to write it.  It is on their site in the &lt;a href=&quot;http://www.lucidimagination.com/solutions/documents&quot;&gt;solutions&lt;/a&gt; section as &lt;a href=&quot;http://www.lucidimagination.com/solutions/whitepapers/Indexing-Text-and-HTML-Files-with-Solr&quot;&gt;Indexing Text and HTML Files with Solr (registration required)&lt;/a&gt;.   And thanks to the Solr Cell committers, who made everything so much easier than before.</description>
  <comments>http://searchtools.livejournal.com/90776.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>1</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/90478.html</guid>
  <pubDate>Mon, 08 Feb 2010 20:13:18 GMT</pubDate>
  <title>Interfaces for Search No-Matches Pages: a Pick and a Pan</title>
  <link>http://searchtools.livejournal.com/90478.html</link>
  <description>On your site, intranet or enterprise search engine, what  happens if a search engine finds no match for the search terms?   &lt;br /&gt;&lt;br /&gt;Below the cut are two different approaches, one slightly verbose and the other so terse as to be baffling.  Look a them, look at yours, look at my page on &lt;a href=&quot;http://www.searchtools.com/guide/nomatches.html&quot;&gt;good things to do with the no matches page&lt;/a&gt;, and see if there&apos;s something you can do better.&lt;br /&gt;&lt;br /&gt;&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;Good: Do This!&lt;/strong&gt;! &lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;http://www.searchtools.com/analysis/images/ebert_explains.png&quot;&gt;&lt;img src=&quot;http://www.searchtools.com/analysis/images/ebert_explains_small.png&quot; alt=&quot;note on results as transcribed below&quot; border=&quot;1&quot;&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I like the way the explanation is conversational without being too informal:&lt;br /&gt;&lt;blockquote&gt;&quot;If you were searching for a movie title and found no results, it is either because Roger Ebert never reviewed the film (the archive is from 1967- present, plus classic &quot;Great Movies&quot;), or the review itself hasn&apos;t been formatted and imported into our database yet. We&apos;re working on it.  The search engine first looks for titles and names, then for any matches in the text. So, any seemingly incongruous results that appear above have your search word(s) in the body of the article.&quot;  &lt;/blockquote&gt;I would advise the site designer to make the font size larger, and text into lines or bullet points and put it right in the middle.  Being so small makes it look a little like boilerplate.  &lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Bad: Don&apos;t Do This!&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;http://www.searchtools.com/analysis/images/salon_is_blank.png&quot;&gt;&lt;img src=&quot;http://www.searchtools.com/analysis/images/salon_is_blank_small.png&quot; alt=&quot;nearly blank page for results&quot; border=&quot;1&quot;&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There&apos; s nothing but a big white area and the uninformative text: &quot;Please report any problems to ssearch-help@salon.com&quot;.  It&apos;s so bad as to be user-hostile.  Here are likely user responses to this interface:&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Why am I on this page?&lt;/strong&gt; It doesn&apos;t even say 0 pages found, so this is real dead end. &lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Where is the word I searched on?&lt;/strong&gt; How can I continue if I can&apos;t even tell what I did wrong?&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Where is the search field?&lt;/strong&gt; Way up on the right, with no hints about how to search more effectively&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Did I break the search engine?  Did I break the site?&lt;/strong&gt;  The page worries me, all that blank white.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;What does &quot;report any problems&quot; mean?&lt;/strong&gt;  This really is useless. There should be several blocks of helpful text and perhaps a feedback form.&lt;br /&gt;&lt;br /&gt;Readers: if you have any good or bad examples, or &quot;before&quot; and &quot;after&quot; screenshots, link me to them, please!   I&apos;ll post the best ones, by which I mean both good helpful interfaces and really awful ones.</description>
  <comments>http://searchtools.livejournal.com/90478.html</comments>
  <category>user experience</category>
  <category>search ui</category>
  <lj:mood>amused</lj:mood>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/90353.html</guid>
  <pubDate>Tue, 17 Nov 2009 01:31:32 GMT</pubDate>
  <title>Fundamentals of Enterprise Search workshop slides</title>
  <link>http://searchtools.livejournal.com/90353.html</link>
  <description>the workshop went really well, interesting discussions and questions.&lt;br /&gt;&lt;br /&gt;darwinco did an excellent near-real-time &lt;a href=&quot;http://www.typepad.com/services/trackback/6a0120a4c562d6970b012875a9b7&quot;&gt;blog entry&lt;/a&gt; of the first half.&lt;br /&gt;&lt;br /&gt;The Fundamentals of Search presentation is in HTML and posted &lt;a href=&quot;http://searchtools.com/slides/kmw09/fundamentals_of_search.html&quot;&gt;searchtools.com/slides/kmw09/&lt;/a&gt;.  &lt;br /&gt;  &lt;br /&gt;This time, I forced myself to upload right away, even though HTML is really ugly. The other reason I hate powerpoint.  Also, if the file has dashes in the name, it&apos;s incompatible with Safari.   That was very annoying to debug.  Is there any way to export nicely? &lt;br /&gt;&lt;br /&gt;The s5 html slide thing was clunky and slow, especially for really long presentations, and the css was too finicky for me.   I was wrongly annoyed with Apple Keynote because it didn&apos;t do XML (which seems so perfect for presentations), but between the announcement (when I asked) and the release, they seem to have used it as the base format.  Guess I&apos;ll look at Keynote again.</description>
  <comments>http://searchtools.livejournal.com/90353.html</comments>
  <lj:mood>accomplished</lj:mood>
  <lj:security>public</lj:security>
  <lj:reply-count>3</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/90010.html</guid>
  <pubDate>Sat, 14 Nov 2009 00:46:07 GMT</pubDate>
  <title>Enterprise Search Summit West: Nov. 16-19, 2009</title>
  <link>http://searchtools.livejournal.com/90010.html</link>
  <description>The &lt;a href=&quot;http://enterprisesearchsummit.com/west2009/&quot;&gt;Enterprise Search Summit&lt;/a&gt; (west and east) are always great meetings, and very productive for me.  The presentations are less vendor-brainwashing and more valuable insights and case studies.  I learn a lot from those exhibitors who have programmers and product managers for me to talk with, the group lunches, and the hallway conversations.  &lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;https://secure.infotoday.com/forms/default.aspx?form=esswest&amp;amp;priority=ESC8&quot;&gt; ESS registration&lt;/a&gt; is still open. The expo passes are free when you register online ($25 onsite). Online registration will be closing this weekend, but you can register on site at the &lt;a href=&quot;http://sanjose.org/meetings/facilities/convention.php&quot;&gt;San Jose convention center&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;My workshop, &lt;a href=&quot;http://bit.ly/ESSFES&quot;&gt;Fundamentals of Enterprise Search&lt;/a&gt;  is on Mon. Nov. 16, 9 am -1 pm.  If you can&apos;t make that time, please contact me for possible webinars and corporate training.&lt;br /&gt;  &lt;br /&gt;If you&apos;d like to schedule a meeting, my open times are on Monday afternoon, Tuesday late morning, and Wednesday morning: please email or comment, and I will get back to you as soon as I can.&lt;br /&gt;&lt;br /&gt;Keep an eye on this blog or at &lt;a href=&quot;http://twitter.com/searchtools_avi&quot;&gt;twitter&lt;/a&gt; or &lt;a href=&quot;http://www.linkedin.com/in/avirr&quot;&gt;linked-in&lt;/a&gt;, I will try to share the best and most exciting parts.</description>
  <comments>http://searchtools.livejournal.com/90010.html</comments>
  <lj:mood>excited</lj:mood>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/89756.html</guid>
  <pubDate>Fri, 13 Nov 2009 23:50:52 GMT</pubDate>
  <title>Apache Solr 1. and Apachecon Meetup notes</title>
  <link>http://searchtools.livejournal.com/89756.html</link>
  <description>Solr is a leading open-source enterprise search package: this new version is is built on Apache 2.9.1 and Java 1.5 VM.&lt;br /&gt;&lt;br /&gt;New features include speed improvements everywhere, easier replication for scaling search to millions, better database, HTML and Office document compatibility, dynamic clustering of search results, improvements in faceting, much faster numeric range searching and external modules for de-duplication, auto-suggest (aka autocomplete),  statistics, and more.&lt;br /&gt;&lt;br /&gt;For details, see the &lt;a href=&quot;http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.4.0/CHANGES.txt&quot;&gt;Solr 1.4 release notes&lt;/a&gt;, and the information from Lucid Imagination: &lt;a href=&quot;http://www.lucidimagination.com/whitepaper/whats-new-in-solr-1-4&quot;&gt;What&apos;s New in Apache Solr 1.4&lt;/a&gt;, and the Erik Hatcher&apos;s &lt;a href=&quot;http://forms.lucidimagination.com/go/lucidimagination/webinar-solr14&quot;&gt;recorded webinar on Solr 1.4&lt;/a&gt; (registration required).&lt;br /&gt;&lt;br /&gt;&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;There was also an exciting Lucene meetup at Apachecon, and I&apos;m reproducing my live-twitters for posterity.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;informal polls, over half the people have more a million items for searching, a couple have &amp;gt; 500 million &lt;/li&gt;&lt;br /&gt;&lt;li&gt;Erik Hatcher showing #blacklight - open source library catalog + electronic resources + user interaction&lt;/li&gt;&lt;br /&gt;&lt;li&gt;AJAX front end for Solr - &lt;a href=&quot;http://evolvingweb.github.com/ajax-solr/&quot;&gt;ajax-solr&lt;/a&gt; by evolvingweb.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;TrieRange numeric and range searching (&lt;a href=&quot;http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/&quot;&gt;now in Solr 1.4&lt;/a&gt;) - example in GIS at &lt;a href=&quot;http://pangea.de&quot;&gt;pangea.de&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href=&quot;http://sourceforge.net/projects/katta/&quot;&gt;Katta&lt;/a&gt; project: Lucene distributed search, shards, failover, node selection, adds IDF to relevance ranking.  It works with Hadoop, can push index updates to distributed servers every two minutes (near-real-time!)and they like the &lt;a href=&quot;http://hadoop.apache.org/zookeeper/&quot;&gt;Zookeeper&lt;/a&gt; centralized dispatcher software a lot.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href=&quot;http://code.google.com/p/zoie/&quot;&gt;Zoie&lt;/a&gt; - realtime search index management for Lucene, in-memory workspace for deletes, 15,000 index items per minute; queries: 206,000 per minute.  Developed by the LinkedIn search team &lt;/li&gt;&lt;br /&gt;&lt;li&gt;Lightning talk: ideas about stopword-liked tiered indexes, store only frequently used terms in memory index, go to disk for the rest.  &lt;a href=&quot;http://issues.apache.org/jira/browse/LUCENE-1812&quot;&gt;Carmel pruning&lt;/a&gt; &lt;/li&gt;&lt;br /&gt;&lt;li&gt;Lightning talk: Lucy is the Lucene port in object-oriented C (with object overrides), it supports other languages (Perl, Ruby) for modular plug-ins, tracking Lucene fairly quickly.&lt;/li&gt; &lt;br /&gt;&lt;li&gt;Lightning talk: &lt;a href=&quot;http://typo3.com&quot;&gt;Typo3&lt;/a&gt; (open source content management system) uses Solr&lt;/li&gt;  &lt;br /&gt;&lt;li&gt;LMentioned: &lt;a href=&quot;http://devzone.zend.com/article/11024-Announcing-the-Apache-Solr-extension-in-PHP&quot;&gt;Apache Solr extension in PHP&lt;/a&gt; and &lt;a href=&quot;http://pecl.php.net/package/solr&quot;&gt;Solr PHP package&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Thanks to &lt;a href=&quot;http://lucidimagination.com&quot;&gt;lucidimaginaton&lt;/a&gt; for sponsoring &amp; &lt;a href=&quot;http://screamsorbet.com/&quot;&gt; screamsorbet&lt;/a&gt; for the unexpected treat. &lt;br /&gt;</description>
  <comments>http://searchtools.livejournal.com/89756.html</comments>
  <lj:mood>tracking</lj:mood>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/88949.html</guid>
  <pubDate>Tue, 03 Nov 2009 03:20:46 GMT</pubDate>
  <title>Looking at real-time and near-real-time search</title>
  <link>http://searchtools.livejournal.com/88949.html</link>
  <description>&lt;strong&gt;Real-time vs. near-real-time search&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;In theory, &quot;real-time&quot; search indicates that the content is indexed and searchable at the same moment as it is posted and readable.  To do this, the publishing platform (usually a database) must trigger the search engine indexer at the same time as it formats the publication text.  The search engine then adds this to their searchable index, so that someone can click the search button and find the new information at virtually the same time as the text is being published online.  &lt;br /&gt;&lt;br /&gt;In practice, I have yet to find a system that doesn&apos;t have a lag of at least 30 seconds (including Twitter&apos;s internal search), and I call that &lt;em&gt;near-real-time search&lt;/em&gt;.  I don&apos;t think that&apos;s a bad thing, just that the labels we use should be accurate.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Stock tickers are real-time&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Stock traders had near-real-time streaming data for well over a century, first on telegraph and then tickertape.  Now they have actual real-time information: because the data is tiny, the system can send hundreds of thousands of prices per second. Obviously, no human can handle that, but the systems are all automated now, and currently can place orders in 1.5 milliseconds: they&apos;re working on reducing that by tenfold (&lt;a href=&quot;http://exchanges.nyse.com/archives/2009/02/prop_speed.php&quot;&gt;Universal Trading Platform&lt;/a&gt;).  That should be the benchmark for real-time search.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Real-time publishing vs. real-time search&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;Danny Sullivan of SearchEngineLand &lt;a href=&quot;http://searchengineland.com/what-is-real-time-search-definitions-players-22172&quot;&gt;defines real-time search&lt;/a&gt;  as &quot;looking through material that literally is published in real time&quot;.  So Twitter, Flikr and similar systems are real-time, because they require little thought, but blogs and news stories are not. Danny and I disagree on the noun there: he defines it as search of real-time content, where I define it search with an imperceptible time lag from publication.  Breaking news, like the San Francisco Bay Bridge re-opening today, can come from anywhere: as far as I can tell, the &lt;a href=&quot;http://baybridgeinfo.org&quot;&gt;Bay Bridge Info site&lt;/a&gt; had it first.  &lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Thinking about near-real-time indexing, and date vs. relevance ranking&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;I have some more ideas about near-real-time indexing, and the specific challenges it creates retrieval.  The Bay Bridge example is a really useful way to examine how relevance can fit into the picture.  I would very much like to hear from anyone who has experience with these problems.  Please comment or post links, enlighten me.&lt;br /&gt;&lt;img src=&quot;http://stools.icons.ljtoys.org.uk/mi/dot.gif&quot; border=&quot;0&quot; alt=&quot;&quot;&gt;</description>
  <comments>http://searchtools.livejournal.com/88949.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>11</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/88595.html</guid>
  <pubDate>Fri, 09 Oct 2009 02:30:56 GMT</pubDate>
  <title>open source search levels the playing field</title>
  <link>http://searchtools.livejournal.com/88595.html</link>
  <description>I wrote a &lt;a href=&quot;http://newsbreaks.infotoday.com/NewsBreaks/Lucene--and-the-Power-of-Open-Source-56497.asp&quot;&gt;NewsBreak&lt;/a&gt; for &lt;a href=&quot;http://infotoday.com&quot;&gt;infotoday.com&lt;/a&gt; about the new version of the open source search engine library &lt;a href=&quot;http://lucene.apache.org&quot;&gt;Lucene 2.9&lt;/a&gt; and associated projects.   Stepping back a bit to look at the whole thing rather than a &lt;a href=&quot;http://searchtools.livejournal.com/88266.html&quot;&gt;feature summary&lt;/a&gt;.  And I ended up with an even deep appreciation of open source search engines in general and in particular, the &lt;a href=&quot;http://lucene.apache.org&quot;&gt;Lucene&lt;/a&gt; family of search-related tools (language ports, &lt;a href=&quot;http://lucene.apache.org/solr/&quot;&gt;Solr&lt;/a&gt; search engine, &lt;a href=&quot;http://lucene.apache.org/nutch/&quot;&gt;Nutch&lt;/a&gt; web crawler, &lt;a href=&quot;http://lucene.apache.org/tika/&quot;&gt;Tika&lt;/a&gt; file format converter, and more).  They are as capable and powerful as many commercial enterprise search engines.&lt;br /&gt;&lt;br /&gt;&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;Open source search seems to have liberated a lot of libraries from their dependence on proprietary catalog systems, most of which did not offer relevance ranking or metadata faceting.   Even when the features are added, many libraries have no money for systems but can do amazing things as a community.  &lt;br /&gt;&lt;br /&gt;There are also ports of the core Lucene search code from Java to other programming languages, such as Ruby, Python, C#.  And I found contributed modules for properly indexing human languages from Danish to Arabic, Russian, and Chinese.  &lt;br /&gt;&lt;br /&gt;Startups can use open source search to scale up quickly -- Digg, LinkedIn, and Netflix all run Lucene core search, and all of these companies give back to the community.&lt;br /&gt;&lt;br /&gt;With open source, the whole is greater than the sum of its parts.&lt;br /&gt;&lt;h5&gt;Disclosure: This blog and the &lt;a href=&quot;http://www.searchtools.com&quot;&gt;SearchTools.com site&lt;/a&gt;  are free, ad-free, and not sponsored by any anyone. Avi sometimes works with search vendors, but does not give them site visitor or survey personal information, or allow relationships with any vendors to change any product review or analysis. Current search vendor consulting client: LucidImagination; see also the &lt;a href=&quot;http://www.searchtools.com/about/consulting.html#disclosure&quot;&gt;list of search vendor clients&lt;/a&gt;.&lt;/h5&gt;&lt;br /&gt;&lt;br /&gt;Questions or ideas about open source search?  Please comment here.</description>
  <comments>http://searchtools.livejournal.com/88595.html</comments>
  <lj:mood>busy</lj:mood>
  <lj:security>public</lj:security>
  <lj:reply-count>5</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/88266.html</guid>
  <pubDate>Thu, 24 Sep 2009 19:07:10 GMT</pubDate>
  <title>Lucene 2.9 to be released very soon!</title>
  <link>http://searchtools.livejournal.com/88266.html</link>
  <description>&lt;p&gt;&lt;a href=&quot;http://lucene.apache.org&quot;&gt;Apache Lucene&lt;/a&gt; is the most prominent open source search engine, and powers search on a lot of really interesting sites.  The new version, 2.9, has internal improvements, re-factoring and new functionality.&lt;/p&gt;
&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;&lt;h3&gt;Lucene 2.9 Features&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;http://wiki.apache.org/lucene-java/NearRealtimeSearch&quot;&gt;&quot;Near real-time&quot; search&lt;/a&gt;&lt;/strong&gt;: new way to search the current in-memory segment  before the index has been written to disk.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;FieldCache&lt;/strong&gt; - takes advantage of the fact that most segments of the index are static, only processes the parts that change, save on time and memory.  Also improved efficiency.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;NumericField&lt;/strong&gt; and &lt;strong&gt;NumericRangeQuery&lt;/strong&gt; - (previously called TrieRange).  This improves the Lucene number indexing, and is faster for searching numbers, geo-locations, and dates, faster for sorting, and hugely faster for range searching.&lt;/li&gt;
&lt;li&gt;Faster &lt;strong&gt;wildcard&lt;/strong&gt; and prefix searching, and a reverse string filter to enable &lt;strong&gt;leading wildcards&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lucene Local&lt;/strong&gt; (Contrib / Spatial) -  can limit queries based on geographic location&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Faster searching over multiple segments&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Better and faster term vector highlighting &lt;/strong&gt;of match terms in context on results page.&lt;/li&gt;
&lt;li&gt;New &lt;strong&gt;Query Parser framework&lt;/strong&gt;, supports additional syntaxes&lt;/li&gt;
&lt;li&gt;Improvements to &lt;strong&gt;Payloads&lt;/strong&gt; (metadata about index terms)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TokenStream&lt;/strong&gt; strong typing options&lt;/li&gt;
&lt;li&gt; Improved &lt;strong&gt;transaction&lt;/strong&gt; processing&lt;/li&gt;
&lt;li&gt; Better &lt;strong&gt;Chinese, Arabic, and Persian&lt;/strong&gt; support&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Backward and Frontward Compatibility&lt;/h3&gt;
&lt;p&gt;There are significant changes in version 2.9 - described in the changes.txt file or the web site (&lt;a href=&quot;http://people.apache.org/~markrmiller/staging-area/lucene2.9changes/Changes.html&quot;&gt;change log&lt;/a&gt;).  A very few items are not backward compatible and several classes are deprecated.  &lt;/p&gt;

&lt;p&gt;All applications should re-compile against the new Lucene 2.9 JAR 2.and test carefully. Version 3.0 will no longer support Java 1.4 and deprecated classes. &lt;/p&gt;

&lt;p&gt;As soon as Lucene 2.9 is released, &lt;a href=&quot;http://wiki.apache.org/nutch/ClusteringPlugin&quot;&gt;Carrot 2 3.1.0&lt;/a&gt; will come out with bug fixes&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solr 1.4&lt;/strong&gt; will use Lucene 2.9 JAR, coming soon, few weeks they hope&lt;/p&gt;

&lt;p&gt;Note: this content extracted painfully by Avi from the Lucene site/wiki/JIRA/mailing list archive, and clarified by  Grant Ingersoll&apos;s webcast sponsored by &lt;a href=&quot;http://www.lucidimagination.com/&quot;&gt;Lucid Imagination&lt;/a&gt;.  I will be happy to fix mistakes and clarify confusion, just comment or send a message and I&apos;ll fix it.&lt;/p&gt;
</description>
  <comments>http://searchtools.livejournal.com/88266.html</comments>
  <lj:mood>encouraged</lj:mood>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/87866.html</guid>
  <pubDate>Thu, 24 Sep 2009 00:23:32 GMT</pubDate>
  <title>good introduction to search analytics article</title>
  <link>http://searchtools.livejournal.com/87866.html</link>
  <description>Very clear and well-written: &lt;a href=&quot;http://www.alistapart.com/articles/internal-site-search-analysis-simple-effective-life-altering/&quot;&gt;Internal Site Search Analysis: Simple, Effective, Life Altering!&lt;/a&gt;.  It covers both search on a site, and search from search engines leading to a site, with very useful examples and screenshots from Google Analytics.</description>
  <comments>http://searchtools.livejournal.com/87866.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/87726.html</guid>
  <pubDate>Thu, 03 Sep 2009 20:36:22 GMT</pubDate>
  <title>Enterprise Search Summit &amp; Infonortics Search Engines Meeting news</title>
  <link>http://searchtools.livejournal.com/87726.html</link>
  <description>&lt;a href=&quot;http://infotoday.com&quot;&gt;Information Today, Inc&lt;/a&gt; (producer of the &lt;a href=&quot;http://enterprisesearchsummit.com&quot;&gt;Enterprise Search Summit&lt;/a&gt;), has announced that it&apos;s acquiring  the &lt;a href=&quot;http://www.infonortics.com/searchengines/index.html&quot;&gt;Infonortics Search Engines Meeting&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The Infonortics conference was great, starting back in the late-90s.  It one of the places where old-line Information Retrieval people and new Web Search people and even site search people could mingle.  I learned a lot from everything there, including some work on entity extraction and text mining that&apos;s only now in general use, (and some that never went anywhere: 3D immersive search results visualization, anyone?)&lt;br /&gt;&lt;br /&gt;In most years, the Infonortics meeting was in Boston in April, while the Enterprise Search Summit was in New York in May: my travel budget and family could only handle one of those, and Nancy Garman let me help with ESS, so I chose that one.&lt;br /&gt;&lt;br /&gt;The &lt;a href=&quot;http://www.enterprisesearchcenter.com/Articles/ReadArticle.aspx?ArticleID=55931&quot;&gt;announcement&lt;/a&gt; doesn&apos;t have details, and I do hope the Search Engines Meeting, with its more theoretical flavor continues, or perhaps a track of it is incorporated into the Enterprise Search Summit.</description>
  <comments>http://searchtools.livejournal.com/87726.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/87513.html</guid>
  <pubDate>Fri, 24 Jul 2009 22:36:30 GMT</pubDate>
  <title>Google CSE - search and results interfaces</title>
  <link>http://searchtools.livejournal.com/87513.html</link>
  <description>&lt;p&gt;There are four kinds of Google Custom Search Engine search / results interface:
&lt;ul&gt;&lt;li&gt;Simple form, showing results on a normal Google-hosted page with minimal customization &lt;a href=&quot;http://searchtools.com/analysis/gsce-compare-interfaces.html&quot; target=&quot;_top&quot;&gt;(example&lt;/a&gt;)&lt;/li&gt; &lt;br /&gt;
&lt;li&gt;Form with links to a template page, JavaScript inserts &lt;strong&gt;iframe&lt;/strong&gt; with search results pre-formatted (&lt;a href=&quot;http://searchtools.com/analysis/gsce-compare-interfaces.html#iframe&quot; target=&quot;_top&quot;&gt;iframe example&lt;/a&gt;). Fits into site colors, design and navigation, but has minimal other results customization, including the width of the results list box.&lt;/li&gt;&lt;br /&gt;
&lt;li&gt;Custom Search Element - AJAX object draws a search form, JavaScript can draw result list anywhere (&lt;a href=&quot;http://searchtools.com/analysis/gsce-compare-interfaces.html#ajax&quot; target=&quot;_top&quot;&gt;AJAX example&lt;/a&gt;). This is the new cool programmatic toy, and &lt;a href=&quot;http://www.searchtools.com/analysis/google-cse-ajax-css.html&quot; target=&quot;_top&quot;&gt;with CSS&lt;/a&gt;, it&apos;s very flexible for customizing width, colors, sizes and styles and more.&lt;/li&gt;
&lt;br /&gt;
&lt;li&gt;XML query and result protocol (paid Site Search only) is by far the most comprehensive and flexible. See the &lt;a href=&quot;http://www.google.com/coop/docs/cse/resultsxml.html&quot; target=&quot;_top&quot;&gt;XML Protocol Reference&lt;/a&gt; for extensive documentation.&lt;/li&gt; 
&lt;/ul&gt;

If you have any questions, suggestions or corrections, please comment on this post (non-account comments allowed but screened), send email to nets 9 at searchtools.com, or contact me through the &lt;a href=&quot;http://www.searchtools.com/site/contact.html&quot; target=&quot;_top&quot;&gt;site contact form.&lt;/a&gt;
&lt;img src=&quot;http://stools.icons.ljtoys.org.uk/mi/dot.gif&quot; border=&quot;0&quot; alt=&quot;&quot;&gt;</description>
  <comments>http://searchtools.livejournal.com/87513.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>6</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/87190.html</guid>
  <pubDate>Thu, 09 Jul 2009 00:14:52 GMT</pubDate>
  <title>Netflix Recommender Prize Won (probably)</title>
  <link>http://searchtools.livejournal.com/87190.html</link>
  <description>Netflix has posted a &lt;a href=&quot;http://www.netflixprize.com/rules&quot;&gt;contest for improving its own movie recommendation system by at least 10%&lt;/a&gt; and the prize is a million dollars. People have been working on this since 2006, and there  have been several Progress prizes.  Finally, four teams merged to create the &lt;a href=&quot;http://www.research.att.com/~volinsky/netflix/bpc.html&quot;&gt;BellKor&apos;s Pragmatic Chaos Team&lt;/a&gt;, which added temporal dynamics to the recommendation weights.  This beat the current Netflix recommendation algorithm by 10.0%%.&lt;br /&gt;&lt;br /&gt;A &lt;a href=&quot;http://cacm.acm.org/news/32450-award-winning-paper-reveals-key-to-netflix-prize/fulltext&quot;&gt;press release&lt;/a&gt; from the &lt;a href=&quot;http://www.sigkdd.org/kdd2009/&quot;&gt;15th ACM SIGKDD (Conference on Knowledge Discovery and Data Mining)&lt;/a&gt; explains,  &apos;While movies themselves stay the same, the humans who rate them are anything but static. As [Yehuda] Koren puts it, &quot;The way I rate movies today can be very different from how I rate them even tomorrow.&quot;&apos;&lt;br /&gt;&lt;br /&gt;However, as per the rules and a &lt;a href=&quot;http://www.netflixprize.com//community/viewtopic.php?id=1443&quot;&gt;note in the Netflix prize&lt;/a&gt; forum that under the terms of the contest, the other teams have thirty days, until July 26, 2009 to submit their own solutions, and be considered for the prize.  That makes sense, because the others may have been refining algorithms and testing slowly.</description>
  <comments>http://searchtools.livejournal.com/87190.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/86836.html</guid>
  <pubDate>Wed, 08 Jul 2009 00:09:32 GMT</pubDate>
  <title>thoughts on search engine comparisons</title>
  <link>http://searchtools.livejournal.com/86836.html</link>
  <description>Vik Singh wrote an in-depth post about his &lt;a href=&quot;http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/&quot;&gt;comparison of open-source search engines&lt;/a&gt;.  He  tested the default configurations for Lucene, zettair, sphinx, and Xapian, with a nod to sqlite.  &lt;br /&gt;&lt;br /&gt;In the feedback section, there are some interesting comments, and several experts on various open-source search engines pointing out that it&apos;s a bit odd to throw default settings at specialized content and expect to have a robust comparison. Otis Gospodnetić (one of the Lucene/Solr core developers) has an answer in:  &lt;a href=&quot;http://www.jroller.com/otis/entry/open_source_search_engine_benchmark&quot;&gt;Open Source Search Benchmark&lt;/a&gt; and Charlie Hull posted &lt;a href=&quot;http://www.flax.co.uk/blog/2009/07/07/xapian-compared/&quot;&gt;Xapian compared&lt;/a&gt; in response. &lt;br /&gt;&lt;br /&gt;&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;&lt;h4&gt;Indexing Metrics&lt;/h4&gt;The first corpus is approximately a million tweets, which are tiny: they are at the &quot;very short&quot; edge of any spectrum of content items.  The table of metrics focuses on indexing speed, memory and disc size requirements.   &lt;br /&gt;&lt;br /&gt;The other source is about 200,000 journal article metadata items (about 300MB) from &lt;a href=&quot;http://trec.nist.gov/data/t9_filtering.html&quot;&gt;TREC-9 filtering track&lt;/a&gt;.  It covers indexing memory requirements, speed, and size, along with search memory requirements, time, and relevancy.   &lt;br /&gt;&lt;br /&gt;I do not understand why Vik is so focused on index size.  I&apos;ve found that it&apos;s much better to take more room for an enriched index, with gentle stemming, including stopwords, alternates for accented characters and ambiguous-term punctuation, position and field metadata, etc.    Disk space and memory are so much cheaper these days, and a totally stripped-down index makes many search features impossible to implement.&lt;br /&gt;&lt;br /&gt;Erlend Strømsvik on the comments to this post pointed out the importance of index add/delete, and that&apos;s where the indexing speed and disc footprint might be more of a significant issue. to me.  Now we&apos;re all looking at near-real-time inserts, which may or may not be compatible with the basic architecture of a search engine.&lt;br /&gt;&lt;br /&gt;Using tweets and medical metadata and abstracts (which are by nature intensively hand-crafted) seems a bit limited, I&apos;d like to see a more heterogeneous corpus, including long html, ugly html, broken html, random office documents, etc.  My default for gathering this kind of thing is the US Federal government, which has no copyright as such.  &lt;br /&gt;&lt;br /&gt;In addition to out-of-the-box, it would be very interesting to see comparisons for lightly-tuned search engines, maybe no more than 20 or 30 configuration line change  It&apos;s not a matter of fairness, it&apos;s just a more valuable comparison, starting from about the same place.&lt;br /&gt;&lt;br /&gt;&lt;a name=&quot;cutid2&quot;&gt;&lt;/a&gt;&lt;h4&gt;Relevance Evaluation&lt;/h4&gt;This test uses the TREC corpus, which includes 63 sets of query terms and for each article, a judgment as to whether the article is very relevant, somewhat relevant, or not relevant to that query. Including an middle value reflects the ambiguous nature of search: for a lot of queries, there is no binary yes-or-no answer.  Some matched items are less relevant than others, but they are not irrelevant.  The &lt;a href=&quot;http://en.wikipedia.org/wiki/Discounted_Cumulative_Gain&quot;&gt;DCG (discounted cumulative gain)&lt;/a&gt; seems to be a good way to include the position in the search results when calculating the  effectiveness of a search engine relevance algorithm.  &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;There are a lot of excellent insights in this post, and even where we disagree it helps everyone clarify their thoughts on how to set up meaningful comparisons.&lt;br /&gt;&lt;br /&gt;All the more reason to want the &lt;a href=&quot;http://lucene.apache.org/openrelevance/mail&quot;&gt;Open Relevance Project &lt;/a&gt; to succeed.</description>
  <comments>http://searchtools.livejournal.com/86836.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/86572.html</guid>
  <pubDate>Fri, 26 Jun 2009 03:46:17 GMT</pubDate>
  <title>Recommended Book: Search User Interfaces</title>
  <link>http://searchtools.livejournal.com/86572.html</link>
  <description>Marti Hearst, Search User Interfaces, 1st ed.    Cambridge University Press, Septermber 2009. [Online: &lt;a href=&quot;http://www.searchuserinterfaces.com&quot;&gt;www.searchuserinterfaces.com&lt;/a&gt;]&lt;br /&gt;&lt;br /&gt;This is a clear and thoughtful discussion of many aspects of Search User Interfaces. It thoroughly synthesizes most of the recent research in the field with fluency and insight.   And doesn&apos;t hurt that she (and the evidence agree) with me about personalization and visualization of search results, both of which I find over-hyped and low on ROI.  The best kind of academic work: rigorous and very useful at the same time.&lt;br /&gt;&lt;br /&gt;Chapters&lt;br /&gt;&lt;br /&gt;# 1: Design of Search User Interfaces&lt;br /&gt;# 2: Evaluation of Search User Interfaces&lt;br /&gt;# 3: Models of the Information Seeking Process&lt;br /&gt;# 4: Query Specification&lt;br /&gt;# 5: Presentation of Search Results&lt;br /&gt;# 6: Query Reformulation&lt;br /&gt;# 7: Supporting the Search Process&lt;br /&gt;# 8: Integrating Navigation with Search&lt;br /&gt;# 9: Personalization in Search&lt;br /&gt;# 10: Information Visualization for Search Interfaces&lt;br /&gt;# 11: Information Visualization for Text Analysis&lt;br /&gt;# 12: Emerging Trends in Search &lt;br /&gt;&lt;br /&gt;Marti Hearst is a professor at the &lt;a href=&quot;ischool.berkeley.edu&quot;&gt;ISchool, University of California, Berkeley&lt;/a&gt; (formerly the School of Library and Information Studies, my alma mater).   Ms. Hearst was the pioneer in demonstrating the quality of &lt;a href=&quot;http://www.searchtools.com/info/faceted-metadata.html&quot;&gt;faceted metadata search interfaces&lt;/a&gt; with her &lt;a href=&quot;http://flamenco.berkeley.edu&quot;&gt;Flamenco research&lt;/a&gt; in the early 2000s. &lt;br /&gt;&lt;br /&gt;additional keywords:  UI, UX, user experience, usability</description>
  <comments>http://searchtools.livejournal.com/86572.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/86521.html</guid>
  <pubDate>Thu, 25 Jun 2009 00:20:56 GMT</pubDate>
  <title>Google CSE AJAX API  - Using CSS on the search results</title>
  <link>http://searchtools.livejournal.com/86521.html</link>
  <description>&lt;h2 style=&quot;font-size:medium;&quot;&gt;CSS styling of AJAX search results&lt;/h2&gt;

&lt;p&gt;For customizing the fonts, colors, size, and styles of the Google Custom Search Engine (CSE), there is a many-layered hierarchy of div class names. For example, I found the right name to to turn all the result item titles red (even the bold subsections).  I had to use the name of my search results for the browser to render correctly:&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;font-size:small; font-family:monospace; margin-left::3em&quot;&gt;#cseDiv .gs-title * { color:#990033; }&lt;/span&gt;&lt;/p&gt;

&lt;h2 style=&quot;font-size:medium;&quot;&gt;Showing the Long URLs&lt;/h2&gt;
&lt;p align=&quot;left&quot; class=&quot;text&quot;&gt;The Google CSE has recently switched from displaying the full URL of search results, to just the host name, so &lt;span style=&quot;color:green&quot;&gt;www.searchtools.com/tools/google-service.html&lt;/span&gt; became simply &lt;span style=&quot;color:green;&quot;&gt;www.searchtools.com&lt;/span&gt;. Many people want to switch it back.&lt;/p&gt;
&lt;p align=&quot;left&quot; class=&quot;text&quot;&gt;CSS allows us to specify whether a section is visible (&lt;span style=&quot;font-size:small; font-family:monospace; margin-left::3em&quot;&gt;display:block&lt;/span&gt;) or invisible (&lt;span style=&quot;font-size:small; font-family:monospace; margin-left::3em&quot;&gt;display:none&lt;/span&gt;). The GwebResults has URLs  in classes deep within the results section: take a look using the Safari 4 Web Inspector or Firebug sometime. To fix the the display URL, I&apos;ve found only one selector that really works, using my style name from the results div (&lt;span style=&quot;font-size:small; font-family:monospace; margin-left::3em&quot;&gt;cseDiv&lt;/span&gt;):&lt;/p&gt;
	&lt;p style=&quot;font-size:small; font-family:monospace; margin-left::3em&quot;&gt;#cseDiv div.gs-visibleUrl.gs-visibleUrl-long { display:block;  }&lt;/p&gt;
	&lt;p style=&quot;font-size:small; font-family:monospace; margin-left::3em&quot;&gt;#cseDiv div.gs-visibleUrl.gs-visibleUrl-short { display:none;  }&lt;/p&gt;
&lt;p align=&quot;left&quot; class=&quot;text&quot;&gt;If you add that to the CSS section, you should be able to see the full path URLs in the search results.&lt;/p&gt;
&lt;p&gt;For more information, see my &lt;a href=&quot;http://www.searchtools.com/analysis/google-cse-ajax-api-analysis.html&quot;&gt;Analysis of the CSE and AJAX API&lt;/a&gt; and &lt;a href=&quot;http://www.searchtools.com/analysis/google-cse-ajax-basic-example.html&quot;&gt;Basic CSE AJAX sample code&lt;/a&gt;.  Or contact me directly, by commenting on this post (messages will be screened) or use the SearchTools.com &lt;a href=&quot;http://www.searchtools.com/site/contact.html&quot;&gt;contact page&lt;/a&gt;</description>
  <comments>http://searchtools.livejournal.com/86521.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/86044.html</guid>
  <pubDate>Tue, 23 Jun 2009 03:50:55 GMT</pubDate>
  <title>Decoding the new Google Custom Search API</title>
  <link>http://searchtools.livejournal.com/86044.html</link>
  <description>&lt;p&gt;Google has released a new version of their Custom/Site Search service, and added an &amp;quot;Element&amp;quot; -- a wizard-driven JavaScript that non-technical users can copy and paste to their web sites, even blogs which do not allow uploading. Search Tools has a new &lt;a href=&quot;http://www.searchtools.com/analysis/google-cse-ajax-api-analysis.html&quot;&gt;Analysis of the CSE and AJAX API&lt;/a&gt;.  I also wrote a  &lt;a href=&quot;http://www.searchtools.com/analysis/google-cse-ajax-basic-example.html&quot;&gt;fully-commented sample code&lt;/a&gt; with a live version on the same page, because this is much harder for non-programmers to customize than the forms or even the Site Search XML interface (paid version only). I&apos;ll be doing more on customizing and functionality and display during this week.&lt;/p&gt;
				&lt;p&gt;Also coming soon, an updated version of my &lt;a href=&quot;http://www.searchtools.com/analysis/google-service-2007.html&quot;&gt;Google CSE review from 2007&lt;/a&gt;. New features include: limited on-demand indexing, Best Bets (promotions), synonyms, new interface for filters (refinements), localized to 40 languages and offering transliteration between  character sets. &lt;/p&gt;

&lt;p&gt;please leave a comment if you have questions or feedback, anonymous comments are screened for spam, but I do check them&lt;/p&gt;</description>
  <comments>http://searchtools.livejournal.com/86044.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>3</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/85914.html</guid>
  <pubDate>Fri, 05 Jun 2009 20:50:17 GMT</pubDate>
  <title>Lucene/Solr meetup notes</title>
  <link>http://searchtools.livejournal.com/85914.html</link>
  <description>

&lt;h2&gt;Lucene/Solr Meetup, June 2009&lt;/h2&gt;
&lt;h3&gt;Notes by Avi Rappoport, Search Tools Consulting&lt;/h3&gt;
&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;&lt;h4&gt;Solr 1.4, Near-real-time indexing, Payload efficiency, Trierange, Query parser framework, Zevents, Xoopit, Lucid search, Stopwords are obsolete, OpenRelevance&lt;/h4&gt;

&lt;p&gt;&lt;a href=&quot;http://www.meetup.com/SFBay-Lucene-Solr-Meetup/calendar/10465433/&quot;&gt;Meetup event info&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;These are my sketchy notes, there was so much good stuff, I did not get it all.&lt;/p&gt;
&lt;p&gt;  -- Avi&lt;a href=&quot;http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-Videos&quot;&gt;&lt;/a&gt;, June 5, 2009&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Changes in upcoming Solr 1.4 (Grant Ingersoll)&lt;/b&gt;&lt;img src=&quot;http://www.searchtools.com/images/solr_FC.jpg&quot; alt=&quot;solr logo&quot; align=&quot;right&quot;&gt;&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt; a new logo (see right)		&lt;/li&gt;
	&lt;li&gt;new character filters (from Lucene 2.4) 	&lt;/li&gt;
	&lt;li&gt;faster faceting methods - FieldCache (from Lucene) 	&lt;/li&gt;
	&lt;li&gt;improved numeric range calculations (see TrieRange below) 	&lt;/li&gt;
	&lt;li&gt;Java-based replication with solr request handlers (&lt;a href=&quot;http://www.lucidimagination.com/blog/2009/05/31/solr-index-replication/&quot;&gt;see Lucid blog post&lt;/a&gt;) 	&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://wiki.apache.org/solr/StatsComponent&quot;&gt;StatsComponent&lt;/a&gt; - returns xml for each field 	&lt;/li&gt;
	&lt;li&gt;Term vector component- for proximity &amp;amp; other interesting stuff (&lt;a href=&quot;http://wiki.apache.org/solr/TermVectorComponent&quot;&gt;see Lucid blog post&lt;/a&gt;) 	&lt;/li&gt;
	&lt;li&gt;Duplicate detection during indexing - RemoveDuplicates Token 	&lt;/li&gt;
	&lt;li&gt;Better Arabic handling (from Lucene) 	&lt;/li&gt;
	&lt;li&gt;CharFilter - normalize chars before tokenizing, like a lightweight pipeline 	&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://wiki.apache.org/solr/ExtractingRequestHandler&quot;&gt;Solr Cell&lt;/a&gt; (a.k.a. Content Extracting Library, aka ExtractingRequestHandler) - wrapper around Tika 	&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://wiki.apache.org/solr/ClusteringComponent&quot;&gt;Clustering&lt;/a&gt; - grouping similar docs - jira framework for plugging modules
		&lt;ul&gt;
			&lt;li&gt;first implementation using carrot2: concerned with clustering short extracts of text from search results &lt;/li&gt;
		&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;Configure deletion policy (possibly from Lucene)&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://wiki.apache.org/solr/SolrJS&quot;&gt;SolrJS&lt;/a&gt; - JQuery parsing&lt;/li&gt;
	&lt;li&gt; &lt;a href=&quot;http://wiki.apache.org/solr/VelocityResponseWriter&quot;&gt;VelocityRe&lt;/a&gt;&lt;a href=&quot;http://wiki.apache.org/solr/VelocityResponseWriter&quot;&gt;sponseWriter&lt;/a&gt; - hooks to Velocity templates for interface without middleware or app server&lt;/li&gt;
&lt;/ul&gt;
&lt;hr /&gt;
&lt;h4&gt;Near-real-time indexing - Jason Rutherglen &amp;amp; Jake Mannix (LinkedIn)&lt;/h4&gt;
&lt;ul&gt;
	&lt;li&gt;Historical model was batch indexing&lt;/li&gt;
	&lt;li&gt;For faster indexing, insert/delete several per second, need to worry about I/O efficiency&lt;/li&gt;
	&lt;li&gt;Relevance matters  (unlike, say, in Twitter Search)&lt;/li&gt;
	&lt;li&gt;Have contributed  Lucene patches to especially for deletes and flushing
		&lt;ul&gt;
			&lt;li&gt;1313, 1483, 1231, 1292 were all that I caught&lt;/li&gt;
		&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://wiki.apache.org/lucene-java/NearRealtimeSearch&quot;&gt;NearRealTimeSearch - Lucene&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://wiki.apache.org/solr/RealtimeSearch&quot;&gt;RealTimeSearch - Solr&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;Other LinkedIn open source projects: &lt;a href=&quot;http://code.google.com/p/lucene-ext/&quot;&gt;Zoie&lt;/a&gt; (extensions to Lucene), &lt;a href=&quot;http://code.google.com/p/bobo-browse/&quot;&gt;Bobo&lt;/a&gt; (faceted/parametric search tuned for high performance), &lt;a href=&quot;http://project-voldemort.com/&quot;&gt;Voldemort&lt;/a&gt; (high performance distributed key-value storage)&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Payload Efficiency - Michael Busch (IBM)&lt;/h4&gt;
&lt;ul&gt;
	&lt;li&gt;Current inverted index contains the
		&lt;ul&gt;
			&lt;li&gt;dictionary (list of words)&lt;/li&gt;
			&lt;li&gt;posting list (documents associated with each word)&lt;/li&gt;
		&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;&lt;a href=&quot;http://wiki.apache.org/lucene-java/Payload Planning&quot;&gt;Payloads&lt;/a&gt; - additional optional metadata for each term in the dictionary, for example position, as byte arrays &lt;/li&gt;
	&lt;li&gt;Current Lucene Store (document data) is slow, sequential&lt;/li&gt;
	&lt;li&gt;Payloads and column-stride fields could be much more efficient&lt;/li&gt;
	&lt;li&gt;other stuff I didn&apos;t take good notes on (sorry)&lt;/li&gt;
	&lt;li&gt;Check out the &lt;a href=&quot;http://www.jeremythomerson.com/blog/2008/11/05/apachecon-advanced-indexing-with-lucene-payloads/&quot;&gt;blog post from the ApacheCon about payloads&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;TrieRange&lt;/h4&gt;
&lt;ul&gt;
	&lt;li&gt; Supplements Lucene&apos;s basic string field type, orders of magnitude faster for range searching&lt;/li&gt;
	&lt;li&gt;came out of geo-searching&lt;/li&gt;
	&lt;li&gt;Field mapped to a numeric type: int, long, double, float&lt;/li&gt;
	&lt;li&gt;Code  stores sorts with several layers of precision
		&lt;ul&gt;
				&lt;li&gt;example: year, month, day &lt;/li&gt;
		&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;Naive sorts also much easier.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Query Parser Framework - also IBM&lt;/h4&gt;
&lt;ul&gt;
	&lt;li&gt;Current process is hard to maintain and extend&lt;/li&gt;
	&lt;li&gt;Need to separate syntax and semantics&lt;/li&gt;
	&lt;li&gt;Can generate multiple languages&lt;/li&gt;
	&lt;li&gt;Looks like a pipeline (I may be missing something)&lt;/li&gt;
	&lt;li&gt;Text Parser - converts incoming string to a Query Node Tree
		&lt;ul&gt;
			&lt;li&gt;Iterates through the nodes before going to next parser&lt;/li&gt;
			&lt;li&gt;Can add new parsers in any order, e.g. validation, tokenization&lt;/li&gt;
		&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;Query Builder
		&lt;ul&gt;
			&lt;li&gt;Iterates through the nodes&lt;/li&gt;
			&lt;li&gt;Outputs Lucene or other query language&lt;/li&gt;
		&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;see &lt;a href=&quot;http://www.gossamer-threads.com/lists/lucene/java-dev/72468?do=post_view_flat#72468&quot;&gt;developer discussion&lt;/a&gt; &amp;amp; &lt;a href=&quot;https://issues.apache.org/jira/browse/LUCENE-1567&quot;&gt;JIRA ID 1567&lt;/a&gt; (current status is patch)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href=&quot;http://zevents.com&quot;&gt;&lt;b&gt;Zevents.com&lt;/b&gt;&lt;/a&gt; - fun with deduplication and dynamic re-ranking&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://xoopit.com&quot;&gt;&lt;b&gt;Xoopit.com&lt;/b&gt;&lt;/a&gt; - indexing service in their own cloud, just starting, will provide simple config for Lucene&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://search.lucidimagination.com/&quot;&gt;&lt;b&gt;Lucid search&lt;/b&gt;&lt;/a&gt; - Erik Hatcher&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;index sources: &lt;a href=&quot;http://www.lucidimagination.com/&quot;&gt;lucidimagination.com&lt;/a&gt; site, lucene apache site, email lists&lt;/li&gt;
	&lt;li&gt;facet by source, project, issue (JIRA), author (email)&lt;/li&gt;
	&lt;li&gt;multi-select facets (checkboxes)&lt;/li&gt;
	&lt;li&gt;displays facets with 0 hits&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt; Stopwords Are Obsolete (Avi Rappoport, &lt;a href=&quot;http://www.searchtools.com/about/consulting.html&quot;&gt;Search Tools Consulting&lt;/a&gt;)&lt;/h4&gt;
&lt;ul&gt;
	&lt;li&gt;About 1/2 attendees no longer use stopwords&lt;/li&gt;
	&lt;li&gt;1/4 do use them&lt;/li&gt;
	&lt;li&gt;1/4 do both or don&apos;t know&lt;/li&gt;
	&lt;li&gt;No one has any evidence that they are useful, but they can screw up phrase queries&lt;/li&gt;
	&lt;li&gt;They do make indexes bigger&lt;/li&gt;
	&lt;li&gt;Indexing them leads to search transparency - no question that what you search is what you get&lt;/li&gt;
	&lt;li&gt;Maybe Lucene and Solr should disable stopwords by default&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;a href=&quot;http://wiki.apache.org/lucene-java/OpenRelevance&quot;&gt;OpenRelevance&lt;/a&gt;&lt;/h4&gt;
&lt;ul&gt;
	&lt;li&gt;Need some kind of standard corpus / queries / judgments for testing&lt;/li&gt;
	&lt;li&gt;Still in infant stages of development&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All in all, a great meeting full of very positive energy. &lt;/p&gt;
&lt;p&gt;&lt;hr /&gt;&lt;/p&gt;
&lt;div style=&quot;margin-left:0em; margin-right:1em;&quot;&gt;
		&lt;p style=&quot;font-size:medium; margin-left:4em; margin-right:3em&quot;&gt;Avi Rappoport of Search Tools Consulting can help  choose or fix a  search engine. Please &lt;a href=&quot;http://www.searchtools.com/about/consulting-contact.html&quot;&gt;contact SearchTools&lt;/a&gt; for more information.&lt;/p&gt;
		&lt;hr style=&quot;margin-left:2em; margin-right:2em;&quot; /&gt;
			
		&lt;p style=&quot;font-size:small; margin-left:4em; margin-right:3em;&quot;&gt; 
			&lt;a rel=&quot;license&quot; href=&quot;http://creativecommons.org/licenses/by-sa/3.0/us/&quot;&gt;&lt;img src=&quot;http://creativecommons.org/images/public/somerights20.png&quot; alt=&quot;Creative Commons License&quot; border=&quot;4&quot; align=&quot;left&quot; style=&quot;border-width:4&quot; /&gt;&lt;/a&gt; 
		&amp;nbsp;This information copyright &amp;copy; 2009 &lt;a xmlns:cc=&quot;http://creativecommons.org/ns#&quot; href=&quot;http://www.searchtools.com&quot; property=&quot;cc:attributionName&quot; rel=&quot;cc:attributionURL&quot;&gt;Avi Rappoport, Search Tools Consulting&lt;/a&gt; and is licensed under a &lt;a rel=&quot;license&quot; href=&quot;http://creativecommons.org/licenses/by-sa/3.0/us/&quot;&gt;Creative Commons Attribution-Share Alike 3.0 United States License&lt;/a&gt;. Please attribute to this page&apos;s full URL. Permissions beyond the scope of this license  are available upon  &lt;a xmlns:cc=&quot;http://creativecommons.org/ns#&quot; href=&quot;http://www.searchtools.com/site/contact.html&quot; rel=&quot;cc:morePermissions&quot;&gt;request&lt;/a&gt;. &lt;/p&gt;
	&amp;lt;/noindex&amp;gt;  
	&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;/div&gt;

</description>
  <comments>http://searchtools.livejournal.com/85914.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/85749.html</guid>
  <pubDate>Thu, 28 May 2009 23:32:39 GMT</pubDate>
  <title>Best Practices for thumbnails in search results - article</title>
  <link>http://searchtools.livejournal.com/85749.html</link>
  <description>&lt;a href=&quot;http://www.uxmatters.com/mt/archives/2009/05/making-10000-a-pixel-optimizing-thumbnail-images-in-search-results.php&quot;&gt;Making $10,000 a Pixel: Optimizing Thumbnail Images in Search Results :: UXmatters&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This is a very detailed and helpful article about all the various issues that come up when designing product pictures for search results listings. The word &quot;optimizing&quot; is used for the UI results from local search, rather than anything to do with image compression or SEO.  It also has excellent example screenshots, and a bibliography.  Highly recommended.</description>
  <comments>http://searchtools.livejournal.com/85749.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/85390.html</guid>
  <pubDate>Thu, 28 May 2009 22:40:19 GMT</pubDate>
  <title>Twitter Search Has Big Ambitions</title>
  <link>http://searchtools.livejournal.com/85390.html</link>
  <description>My &lt;a href=&quot;http://newsbreaks.infotoday.com/NewsBreaks/Twitter-Search-Has-Big-Ambitions-53868.asp&quot;&gt;overview of the state of Twitter Search&lt;/a&gt; on infotoday.com - it doesn&apos;t even try to do relevance ranking right now, so it&apos;s not exactly a Google killer, despite the hype.&amp;nbsp; &lt;br /&gt;</description>
  <comments>http://searchtools.livejournal.com/85390.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/84737.html</guid>
  <pubDate>Tue, 21 Apr 2009 01:53:05 GMT</pubDate>
  <title>#AmazonFail: garbage in, garbage out</title>
  <link>http://searchtools.livejournal.com/84737.html</link>
  <description>My article on #amazonfail is up at &lt;a href=&quot;http://bit.ly/avirr&quot;&gt;Amazonfail: How Metadata and Sex Broke the Amazon Book Search&lt;/a&gt;:&lt;br /&gt;&lt;blockquote&gt;Amazon failed in a big way on Easter weekend. As the largest bookstore in the world, if a book does not appear in its lists or its search results, the book practically disappears. The event now known as #AmazonFail involves a great cast of characters-books, metadata, sex, search results, traditionally disenfranchised groups, a possible hacker, the Kindle, the absence of institutional response, and the emergence of Twitter for sharing information very quickly on a massive scale.&lt;/blockquote&gt;&lt;br /&gt;So that&apos;s what I was working on last week.&lt;br /&gt;&lt;img src=&quot;http://stools.icons.ljtoys.org.uk/mi/dot.gif&quot; border=&quot;0&quot; alt=&quot;&quot;&gt;</description>
  <comments>http://searchtools.livejournal.com/84737.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>2</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/84562.html</guid>
  <pubDate>Wed, 15 Apr 2009 21:10:43 GMT</pubDate>
  <title>Enterprise Search Summit / NY,  May 12 - 13 2009</title>
  <link>http://searchtools.livejournal.com/84562.html</link>
  <description>The Enterprise Search Summit &lt;a href=&quot;https://secure.infotoday.com/forms/default.aspx?form=ess2009&quot;&gt;early registration deadline (save $100) is this Friday&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;http://www.enterprisesearchsummit.com/&quot;&gt;ESS&lt;/a&gt; is a great conference for search people - a lot of practical information and case studies, I&apos;ve always had a great time and learned things I didn&apos;t even know I was missing.  I wish I could be there this year.</description>
  <comments>http://searchtools.livejournal.com/84562.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/84476.html</guid>
  <pubDate>Thu, 02 Apr 2009 22:42:38 GMT</pubDate>
  <title>Two new Search / Information Retrieval textbooks</title>
  <link>http://searchtools.livejournal.com/84476.html</link>
  <description>&lt;p&gt;&lt;a href=&quot;http://www.amazon.com/gp/product/0521865719?ie=UTF8&amp;amp;tag=searchtoolscom&quot;&gt;Introduction to Information Retrieval&lt;/a&gt;  by: Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze; July 2008 from Cambridge University Press  &lt;i&gt;[disclosure: the link has my amazon affiliate code]&lt;/i&gt;&lt;/p&gt;

&lt;p&gt;I&apos;ve been going through this book for a while, and I like it. It&apos;s an interesting way of ordering the content, but that content itself seems very much more practical than previous textbooks.  with helpful information about language detection and the issues of index structure and caching, classification (and evaluation thereof), machine learning for interactive search (as opposed to batch), and various algorithms for relevance ranking  They cover practical topics like lowercasing in the index, which I agree with, and there&apos;s not much I find maddeningly wrong.  However, they postpone citations to chapter reference sections, so it is sometimes not clear that there are no citations for some practical topics, such as whether excluding stopwords causes more harm than good -- I agree -- but I sure would like to see the research.  And if there isn&apos;t any, I&apos;d like to know that too (and hope someone will fill the gap, soon!)&lt;/p&gt;

&lt;a name=&quot;cutid1&quot;&gt;&lt;/a&gt;Table of Contents:
&lt;ul&gt;	&lt;li&gt;Boolean retrieval&lt;/li&gt;
   	&lt;li&gt;The term vocabulary and postings lists&lt;/li&gt;
	&lt;li&gt;Dictionaries and tolerant retrieval&lt;/li&gt;
	&lt;li&gt;Index construction&lt;/li&gt;
	&lt;li&gt;Index compression&lt;/li&gt;
	&lt;li&gt;Scoring, term weighting and the vector space model&lt;/li&gt;
	&lt;li&gt;Computing scores in a complete search system&lt;/li&gt;
	&lt;li&gt;Evaluation in information retrieval&lt;/li&gt;
	&lt;li&gt;Relevance feedback and query expansion&lt;/li&gt;
	&lt;li&gt;XML retrieval&lt;/li&gt;
	&lt;li&gt;Probabilistic information retrieval&lt;/li&gt;
	&lt;li&gt;Language models for information retrieval&lt;/li&gt;
	&lt;li&gt;Text classification and Naive Bayes&lt;/li&gt;
	&lt;li&gt;Vector space classification&lt;/li&gt;
	&lt;li&gt;Support vector machines and machine learning on documents&lt;/li&gt;
	&lt;li&gt;Flat clustering&lt;/li&gt;
	&lt;li&gt;Hierarchical clustering&lt;/li&gt;
	&lt;li&gt;Matrix decompositions and latent semantic indexing&lt;/li&gt;
	&lt;li&gt;Web search basics&lt;/li&gt;
	&lt;li&gt;Web crawling and indexes&lt;/li&gt;
	&lt;li&gt;Link analysis
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt; I just saw and ordered &lt;a href=&quot;http://www.amazon.com/gp/product/0136072240?ie=UTF8&amp;amp;tag=searchtoolscom&quot;&gt;Search Engines: Information Retrieval in Practice&lt;/a&gt; by Bruce Croft, Donald Metzler, Trevor Strohman; March 2009 from Addison-Wesley (Pearson Higher Ed) ISBN-10: 013607782X; ISBN-13: 9780136077824  &lt;i&gt;[disclosure: the link has my amazon affiliate code]&lt;/i&gt;&lt;/p&gt;

&lt;p&gt;This looks even closer to my approach to practical information retrieval. I&apos;m all for reducing the distance between theoretical IR, which lives in algorithms and equations, and real-life search, which concentrates on handling short queries, providing useful information foraging pathways, and so on.   More when I read it. &lt;/p&gt; 
&lt;img src=&quot;http://stools.icons.ljtoys.org.uk/mi/dot.gif&quot; border=&quot;0&quot; alt=&quot;&quot;&gt;</description>
  <comments>http://searchtools.livejournal.com/84476.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/84087.html</guid>
  <pubDate>Wed, 25 Mar 2009 22:04:21 GMT</pubDate>
  <title>Openfind Enterprise Search (OES) - New SearchTools Report</title>
  <link>http://searchtools.livejournal.com/84087.html</link>
  <description>&lt;h4&gt;March 25, 2009&lt;/h4&gt;
			&lt;blockquote&gt;
				&lt;h4&gt;&lt;a href=&quot;http://www.searchtools.com/tools/openfind.html&quot;&gt;Openfind Enterprise Search (OES)&lt;/a&gt;&lt;/h4&gt;
				&lt;p&gt;Openfind is a leading enterprise search engine company in Taiwan, providing search to many government departments and corporations since 1998, scalable to over 50 million items in their standard licence. In addition to  documents, it can index text and some numeric content from relational databases, off-loading the search and spreading the server load.&lt;/p&gt;
				&lt;p&gt;OES not only handles many languages including English, Arabic, Japanese, Simplified Chinese, and Traditional Chinese, the search  interface and admin interface are also available in both versions of Chinese.&lt;/p&gt;
				&lt;p&gt;The program has a long list of useful features, including indexing by file system UNC.  robot crawling, near-real-time indexing of structured XML, and full-featured ODBC database connectors. It can read text, HTML, PDF, Open Office and Microsoft Office file formats, and has an API for adding other formats. &lt;/p&gt;
				&lt;p&gt;Search features include Internet Query Operators ((+, -, &amp;quot;&amp;quot;) and Boolean operators, including parentheses. Fields and metadata are searchable, and there are Search admins can use the web interface to edit the lists of stopwords, synonyms, autocomplete items and related terms -- in any character set. Relevance ranking includes term frequency algorithm and some heuristics, the results page UI is clean and simple, and some facets, including date and file type, are automatically generated.&lt;/p&gt;
				&lt;p&gt; There is a good suite of metrics and reports, and excellent documentation. While it&apos;s not the first search engine to have localized interfaces (&lt;a href=&quot;http://www.searchtools.com/tools/ultraseek.html&quot;&gt;Ultraseek&lt;/a&gt;, &lt;a href=&quot;http://www.searchtools.com/tools/autonomy.html&quot;&gt;Autonomy&lt;/a&gt; and &lt;a href=&quot;http://www.searchtools.com/tools/google.html&quot;&gt;Google&lt;/a&gt; come to mind), &lt;a href=&quot;http://www.searchtools.com/tools/openfind.html&quot;&gt;OES&lt;/a&gt; is certainly worth a look.&lt;/p&gt;
			&lt;/blockquote&gt;
&lt;img src=&quot;http://stools.icons.ljtoys.org.uk/mi/dot.gif&quot; border=&quot;0&quot; alt=&quot;&quot;&gt;</description>
  <comments>http://searchtools.livejournal.com/84087.html</comments>
  <lj:mood>working</lj:mood>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://searchtools.livejournal.com/83857.html</guid>
  <pubDate>Tue, 10 Mar 2009 23:31:02 GMT</pubDate>
  <title>searchtools links</title>
  <link>http://searchtools.livejournal.com/83857.html</link>
  <description>Interesting stuff I found today:&lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;http://www.qweery.nl/&quot;&gt;Qweery Search&lt;/a&gt; - based on a Dutch public search engine, has some interesting tweaks based on sales conversions, clicks and feedback. &lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;http://whoosh.ca/&quot;&gt;Woosh&lt;/a&gt; - a Python search engine, originally for help systems.  Features include fielded search, fast indexing and search, pluggable API (for relevance scores, text analysis, etc.), query language and spellchecker.   Free and open-source, based on some other open-source engines.&lt;br /&gt;&lt;br /&gt;&lt;a href=&quot;http://code.google.com/p/xappy/&quot;&gt;Xappy&lt;/a&gt; wrapper for the Xapian open source search engine.   The elsdoerfer blog shows how to &lt;a href=&quot;http://blog.elsdoerfer.name/2008/08/13/django-xappy-searching-with-xapian/&quot;&gt;integrate with django&lt;/a&gt;  And the xapian blog has a &lt;a href=&quot;http://xapian.wordpress.com/2009/02/12/xapian-performance-comparision-with-whoosh/trackback//&quot;&gt;speed comparison with Woosh&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;A low-cost commercial &lt;a href=&quot;http://www.basisoft.com/robots-txt.html&quot;&gt;robtots.txt generator&lt;/a&gt; from BasiSoft (not the same as text analysis company &lt;a href=&quot;http://www.basistech.com/&quot;&gt;Basis Technology&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;Doug Lenat &lt;a href=&quot;http://www.semanticuniverse.com/blogs-i-was-positively-impressed-wolfram-alpha.html&quot;&gt;was positively impressed with Wolfram Alpha&lt;/a&gt; -- particularly good on country and stock price information, temperatures, cities, currencies, chemical compound names and measurements (such as 10cm/year).  He says it has beautiful graphs and tables in the results page (&lt;a href=&quot;http://www.semanticuniverse.com&quot;&gt;http://www.semanticuniverse.com&lt;/a&gt;)</description>
  <comments>http://searchtools.livejournal.com/83857.html</comments>
  <lj:security>public</lj:security>
  <lj:reply-count>5</lj:reply-count>
</item>
</channel>
</rss>
