I'll be posting a call for presentations soon, so please think about what you might want to talk about.
Enterprise Search Summit Fall 2011
November 1-3, in Washington DC, with KMWorld, Taxonomy Boot Camp, and SharePoint Summit
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate. Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0. The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA. Click here to read the paper and the presentation slides. A video of the presentation is freely available on Videolectures.net (turn speaker balance to the left to improve audio quality). Commercial support is available through Kohlschütter Search Intelligence.
Nice overview of why internal search is often worse than web search: mainly that there's little meaningful linking within an intranet, little incentive to make a site easily searchable, and security issues with access control. The post recommends realistic expectations, not indexing low-value content, looking at third-party relevance tools, offering scope or zoned search, and tagging content.
Complex and difficult to read, though I can tell they're trying to make it easier.
Nice introduction to the new structural tags in HTML5: section, article, aside, header, hgroup, footer, and nav, and new content tags: figure, video, audio, canvas.
In the AJAX Element, Custom Search Engine only, passing the "parameter google.search.Search.FILTERED_CSE_RESULT
YDN's BOSS (Build Your Own Search Service) version 2 developer guide. This sends queries to the Yahoo/Bing search engine for web, images, and news results, with very flexible results. Some free queries for development, then price per result.
Helpful vendor blog post about the steps needed to create a development and deployment Java server system within the Microsoft Azure cloud.
Information Industry News + New Web Sites and Tools From Gary Price and Shirl Kennedy - definitely a site worth tracking.
Very short comparison of the Google Mini enterprise search engine appliance and the Funnelback appliance, which wins on price and functionality.
Jeff finds the Google Recipe View data structure lacking, as it mainly filters on ingredients rather than facets such as expertise, health, cuisine, technique, etc. It's also difficult for small sites and blogs to generate the rich snippets metadata. He thinks the system is "under cooked and lacks seasoning".
Describes a Ventana Research report: companies see analytics as a way to make more money, but only 24% are planning analytics-based changes. Budget, infrastructure, and inertia seem to block change.
Classic 1992 article talks about filtering unstructured data and relates it to the then-current understanding of information retrieval.
Uses examples of airline reservation systems and Windows browsers to show how regulations could curb Google tendency to promote its own services. It seems pretty reasonable to me, no techno-paranoia or obvious cluelessness.
Otis Gospodnetic interviews Shay Banon of ElasticSearch, a large-scale data-grid level distributed search solution using the most modern architecture approaches. Some comparisons to Solr, but this is a completely different codebase.
My short article for InfoToday NewsBreaks discusses the functionality, interface and uses for the Greplin personal cloud search engine. It's built on Lucene and other scalable server software, and the OAuth protocol for accessing accounts. Useful service, a cute company story, but they may find fierce competition in a few months.
Problems faced by the DC-x publishing/archiving/text analytics company with Oracle Text in 2009 included stability and scale, missing facets, support issues, database load, query syntax, and expense. The transition to Solr seems to have been reasonably smooth.