Web Search: What do we know from a single search query?
Prof. Jim Jansen summarizes how much web users reveal about ourselves with each query, with a link to a conference paper on the topic. Spoiler: the example query is "ASIS&T annual meeting 2009" -- more specific than "world cup" or "twitter".
tags: search log analysis web-search research
Real-world intranets in 2010: SWOT analysis — Business Information Review
Nice overview of practical issues in enterprise / intranet information systems. It includes a useful section on Search. (SWOT is Strengths, Weaknesses, Opportunities, Threats).
tags: intranet enterprise search engines
Lucene and Solr: 2010 in Review « Sematext Blog
A nice summary of the Lucene/Solr merge in 2010 including the actual codebases, developer mailing lists, and coordinated release versions. There's a new sub-project, ManifoldCF, that manages connectors for datasources and has access control support. Mahout, Nutch, and Tika are now top-level Apache projects. Etc.
tags: open-source enterprise search engines access-control database file-systems robot connectors
Transaction-like Document Processing in AIE
Technical discussion of how to update index items in groups and thus with limited blocking of other processing. This makes access control and content changes much faster and more efficient.
tags: enterprise search engines indexing near-real-time
Internet Archive content, VUFind (Solr), and text mining « CRRA Blog
notes on creating metadata-rich searchable portals
tags: research metadata
Google Research Director Peter Norvig on Being Wrong
[question about pagerank as a stand-in for credibility]
Yeah, that's always a problem. One way we try to counter that is diversity. We haven't figured out any way to get around majority rules, so we want to show the most popular result first, but then after that, for the second one, you don't want something that's almost the same as the first. You prefer some diversity, so there's where minority views start coming in
tags: web-search relevance diverse-results
Posted from Diigo. The rest of my favorite links are here.
Lucid Imagination is a vendor of services and support for the open-source Lucene/Solr search engine code. Their new LucidWorks Enterprise package has three main parts:
- An easy installer/update to the current release version of Lucene/Solr. This also makes installing distributed indexes and additional search servers much easier.
- Additional features that any modern search engine should have, without each developer having to compile and configure multiple packages. These include data sources using web robot, file system, and database accessors, smart defaults in query parsing, some access control, logical query processing rules, auto-complete, spellchecking, synonyms, faceted navigation, and a clean results page design.
- Ways to configure the search (rather than learning Solr calls and config files):
- A clear RESTful API which makes calling search very easy for application programmers.
- An interactive browser admin interface for the people running search who are are less sysadmins than librarians, information architects, usability experts or site producers. It's an early version and there are still glitches but it's now possible for non-programmers to get a Solr search up and going.
The LWE (LucidWorks Enterprise) package is free during development, although paid support is available; payment is required for production deployment.
All of these features complete Solr so that it can now match against the most prominent enterprise search engines, including the Google Search Appliance, SharePoint and FAST search, Vivisimo, Coveo, Exalead, Attivio, and Endeca: even Autonomy now has a browser-based admin interface. Service vendor competitors SearchBlox and Constellio have packages for their versions of Lucene/Solr, and I will be reviewing and comparing all of them in some depth.
Disclosure: I wrote a white paper for Lucid Imagination, and critiqued early versions of the LWE.
TagSoup - brute force HTML parser
TagSoup is designed as a parser, not a whole application; it isn't intended to permanently clean up bad HTML, as HTML Tidy does, only to parse it on the fly. Therefore, it does not convert presentation HTML to CSS or anything similar. It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on.
tags: file-formats java xml
Speller Challenge (spellchecking algorithms for search)
The Speller Challenge - build the best speller that proposes the most plausible spelling alternatives for each search query. It uses the TREC 2008 Million Query Track for training and the Bing Test Dataset for evaluation. The first prize is $10,000 and the gratitude of the orthographically-challenged.
tags: spellchecker research
SearchTools links & notes on Diigo
slightly circular to link to my library, but it's much shorter.
Posted from Diigo. The rest of my favorite links are here.
ManifoldCF - Apache incubation (formerly Lucene Connectors Framework)
Open-source project for connecting to Documentum, FileNet, LiveLink, Meridio, JDBC, Windows fileshares, and SharePoint. Good for search engines needing to index content from these repositories, includes code for Solr indexing.
tags: open-source connectors database indexing search
Semantics in Practice - Enterprise Search Center
Definitions of semantic technology and descriptions of applications such as recommender systems, especially for encouraging longer visits to newspaper and content sites. I have yet to see evidence that natural language search is particularly useful, but enriched retrieval and relevance is always good.
IBM - A comparison of collection types in OmniFind Enterprise Edition, Version 9.1
IBM documentation, summer 2010,= (via arnoldit).
Posted from Diigo. The rest of my favorite links are here.
Stopwords section - Search User Interfaces | Marti Hearst | Cambridge University Press 2009
This points out that ignoring stopwords is opaque to users, who expect if they type "a", "an", and "the", that the search engine will find them.
From the book: In a famous example in the early days of Web search, a searcher who typed “to be or not to be” in a search engine would be shocked to be served empty results. In 1996, a review of eight major search engines found that only AltaVista could handle the Hamlet quote; all others ignored stopwords (Peterson, 1997, Sherman, 2001).
Web search engines have since found efficient ways to index and store even stopwords, because they are so valuable once in a while. So smaller search engines should follow their lead.
tags: stopwords transparency ux ui
A classic case of system behavior that is opaque to system users is the elimination of stopwords
from user queries. (Stopwords are the most common words in the language, usually what linguists call “closed-class” words in that new ones rarely enter the language. Examples from English are articles such as a, an, the
and prepositions such as in, on
.) In a famous example in the early days of Web search, a searcher who typed “to be or not to be”
in a search engine would be shocked to be served empty results. In 1996, a review of eight major search engines found that only AltaVista could handle the Hamlet quote; all others ignored stopwords (Peterson, 1997
, Sherman, 2001
). (Stopword elimination is common in statistical ranking systems for which a paragraph-length query is assumed; not indexing stopwords by position results in significant savings in indexing time and disk space.) Today, this problem is solved on all the major Web search engines.
Autonomy's "Put FAST in the Past" Rescue Program
Microsoft will only be developing FAST for Windows in the future, and will be cutting support for Unix and Linux versions. Autonomy is aggressively marketing to those customers.
"Autonomy will match an organization's Microsoft FAST license implementation with like-for-like capability on all platforms – for 50% of the organization's original license fee for orders placed before December 31.
Autonomy will provide conceptual search and a Sharepoint connector free of charge
Autonomy's Microsoft FAST to IDOL migration tool will index an organization's data, enabling a seamless migration
Autonomy IDOL Enterprise Search can be used transparently from within Microsoft applications including Word, SharePoint, etc., providing end users with a seamless and easy transition"
tags: unix linux enterprise search engines search-vendors
Search implementation maturity level
A set of useful measures to classify the sophistication of a search implementation, and clarify what steps it would take to move up from one level to the next. I have some quibbles about the exact order of steps, but I really like the overall approach.
tags: search engines evaluation
WSDM2011 conference (2011-2-9)
Web Search and Data Mining - international ACM conference, Hong Kong, during February 9-12, 2011.
Celebros - search, navigation & analytics solution for online stores
concept-based semantic e-commerce search. I haven't tested it yet.
tags: ecommerce site search engine
Nextopia e-commerce search
Offers search for online catalogs, with images and faceted search options.
tags: site search engine ecommerce
Lucene Java 3.0.3 and 2.9.4 (bug fix releases)
Bugfixes for Lucene Java 2.x (old branch) and 3.x (new trunk).
tags: open-source search engine java
Constellio | Open Source Enterprise Search
Constellio is transitioning from a closed to open-source search engine, based on Lucene/Solr and compatible with Google Enterprise Connector Manager.
tags: open-source search-vendors file-formats connectors
Contegra Systems | Services
A systems integrator for information management, they work with several search engines including dtSearch, Exalead and FAST.
MaxxCAT - Enterprise Search Appliances
Hardware-software combination designed for easy connection to data sources, lightweight JSON API, scalability to hundreds of millions of items with fast response and high availability. These are significantly cheaper than the Google Mini and GSA, and the licenses don't time out.
tags: enterprise search appliances APIs database indexing
Fusing Enterprise Search and Social Bookmarking - MIKE2.0
More practical than most social search proposals, this treats public bookmarks as a form of metadata to be included in relevance and results display. It also has a note about the value of weak social ties in diffusing information beyond one's normal circle.
tags: social search engines enterprise folksonomy
Posted from Diigo. The rest of my favorite links are here.
infotoday newsbreak, october 14, 2010
by Avi Rappoport, Search Tools Consulting
In this modern age, big institutions have giant piles of data about all their operations: the question is what to do with all those bits. Extracting the right information can help avoid waste, delays, systems failures, even terrorist threats. For example, look at Toyota’s customer support and repair data: if the management had been looking, they would have noticed that something was going terribly wrong. Business intelligence (BI) means mining through all that digital data—in legacy systems, databases, and even spreadsheets—and reporting what’s going on. This generally requires creating aggregations that need server farms with big hard disks and lots of memory. But text search engine technology, using sophisticated versions of inverted indexing, can create files that are effectively shadow databases in much less space, optimized for fast retrieval. These search/BI hybrids also provide sophisticated access to the contents of text fields, making customers very happy indeed. more...
I admit to being pleased that inverted indexes turn out to be so good, as per Zobel and Moffat, "Inverted files for text search engines" (2006)
. But I'd really like to know what the limits of these BI tools are: anyone have any insights?
I got distracted the other day when I noticed that new Google Docs 2010 editor is more of a word processor than the old one, and insists on showing me gigantic on-screen margins. Taking a further look, I discovered that Google had very quietly made this my default, rather than the old comfortable version. And then I made a list of things that don't work
. And a hashtag #gd2010
The new editor was in beta from April to June 15, which was clearly not long enough. Now the default is to create documents in the 2010 Docs format, and people get really confused (lots of questions in the Help forums
). Having it as a hosted service just makes it worse: Google can change anything without warning (that's my dark side of cloud services).
Thinking about how hard it is to release a big upgrade, and the specific demands it has on this organization that is used to incremental upgrades, I wrote an article: Google Docs 2010 Not Ready for Prime Time
, and a list of issues
. These should clarify the issues and I hope give tech support people some place to start.
What do you think? Comment with an LJ account, OpenID, or anonymously, I screen for spam and flames.
My InfoToday article: Zoho Search fills an important gap in the cloud business suite
Zoho Corporation offers a suite of cloud-based applications for small to medium-sized businesses. It has everything from email to HR functions, conferencing, and invoicing. The most basic functions are word processing, spreadsheets, and presentations, so it's a direct competitor to Microsoft Office Web Apps and Google Apps, neither of which has cross-application search.
In early June, Zoho rolled out an Search function, allowing users to find text across their email, word-processing, presentation, shared notebook and discussions documents, with other apps forthcoming. Essentially, this is an Intranet aggregate search, including access controls. Because Zoho is hosting the content, it can do internal handoffs of indexable data with very low overhead. It uses a familiar format format for search results: the title or email subject, some text (here the first line), location, author and date.
Since I wrote the article, the base interface at search.zoho.com
has improved tremendously. Read my article on InfoToday
for more details of the context and the features within Zoho Search. It's still a bit rough around the edges, but they have the basics and it's a great idea.
Federated vs. Aggregated Search Architectures
Federated search systems accept user queries, convert the query language, and send the queries to one or more remote search engines. They then display the results, sometimes in separate blocks, sometimes merged together. vital for searching external or un-owned data sources, such as national patent databases or legal archives. Federated search requires a lot of work to translate queries and deal with results. The heavy lifting is done at search time, which is good for absolutely current content and access control.
Aggregated search systems gather and index text from many different data sources. When the user sends a query, it can be handled locally. Aggregated search requires some work to get data from multiple data sources, and the ability to scale index size nearly exponentially
My research for this presentation indicated that each is useful in specific circumstances (I know, no surprise there). Many data sources are obviously best accessed by one or the other, but it's the corner cases that are tricky. Aspects to consider include:
- size of the content in the source,
- how often your users need that content,
- content change rate
- importance of real-time access control permissions changes
- content licensing rules
- available tools for indexing / querying
- difficulty of extracting and indexing
- quality of the internal search engine
- difficulty of sending queries and receiving results.
Slides (with fish!) presented by Avi Rappoport at ESS, May 2010
Federated and Aggregated Search, Web View (color PDF)
Federated and Aggregated Search, Printable (grayscale 4-up PDF)
Comments? Arguments? Explanations? Please discuss below. Want an analysis of your data sources? I can do that, comment here or send me a message.
W3C¹ has just announced a Recommendation
² regularizing the math and science character set for XML³, which should filter down to HTML. For example, there seem to be five ways to refer to the Greek letter epsilon (ε), having rules clarifies them nicely. It will be much easier to search for equations and formulae, which are used in everything from financial calculations to architecture.
Anything that pins down textual representation of concepts is always going to be a good thing for search. That's why search people are so enamored with Unicode
. Most modern search engines convert from old-fashioned system-specific character sets to Unicode before indexing. When the search terms get the same treatment, they will match the index terms and the search will be successful.
So basically, I think this is a very good thing. Search engine indexers need to log and report trouble with character sets, because there are so many messy ones out there, and indexing files as random glyphs is a bad thing. But this will make it a bit easier.
is the World Wide Web Consortium, the closest thing to a governing body for the web. Their standards define the protocols, so most browsers can view most web sites.
are the W3C name for standards. IIRC they use different terminology because the official International Standards Organization (which can take decades to get things done), was territorial about the word "standard".
³ I'm looking at you, Microsoft, with your special XML files for Office programs.
Once upon a time, not so very long ago, the only people who could get Lucene/Solr to index HTML and plain text files were wizards skilled in the ways of compiler compatibility, library dependencies, and the dreaded make
Fear no more! As if by magic, it's now quite simple.
- Firstly, the Solr project has incorporated the Apache Tika code, which can open and read text from most of the most popular file formats, including plain text, XML, PDF, Microsoft Office formats, and even HTML. Tika can do more, it's a content analysis toolkit, but for Solr purposes, it's the opening and reading that matters.
The Solr interface for Tika is Solr Cell (in the source code, ExtractingRequestHandler), and it just works. You call Solr update with the addition of the RESTful path /extract, give it file name, a few parameters, and zing! it's indexed. If you use a corresponding schema, not only is the text indexed, but internal metadata (like title tag) and external metadata, (like file name and size), are also stored as fields which are can be indexed, stored, and searchable.
- Secondly, I have written a tutorial on exactly how to use Solr Cell to index text and HTML files, using the curl command line utility. It walks through these steps:
- changing the example LucidWorks or Solr Tutorial schema to store full text
- test-indexing a local XML file
- indexing a local text file
- indexing all text files in a folder
- indexing a local HTML file
- Indexing a remote HTML file as a web page.
and has lots of suggestions on where to go next.
The tutorial is free to everyone, thanks to Lucid Imagination who paid me to write it. It is on their site in the solutions
section as Indexing Text and HTML Files with Solr (registration required)
. And thanks to the Solr Cell committers, who made everything so much easier than before.
On your site, intranet or enterprise search engine, what happens if a search engine finds no match for the search terms?
Below the cut are two different approaches, one slightly verbose and the other so terse as to be baffling. Look a them, look at yours, look at my page on good things to do with the no matches page
, and see if there's something you can do better.( screenshots of good and bad interfaces to deal with no matches for a searchCollapse )
Readers: if you have any good or bad examples, or "before" and "after" screenshots, link me to them, please! I'll post the best ones, by which I mean both good helpful interfaces and really awful ones.