November 2nd, 2009

Looking at real-time and near-real-time search

Real-time vs. near-real-time search

In theory, "real-time" search indicates that the content is indexed and searchable at the same moment as it is posted and readable. To do this, the publishing platform (usually a database) must trigger the search engine indexer at the same time as it formats the publication text. The search engine then adds this to their searchable index, so that someone can click the search button and find the new information at virtually the same time as the text is being published online.

In practice, I have yet to find a system that doesn't have a lag of at least 30 seconds (including Twitter's internal search), and I call that near-real-time search. I don't think that's a bad thing, just that the labels we use should be accurate.

Stock tickers are real-time

Stock traders had near-real-time streaming data for well over a century, first on telegraph and then tickertape. Now they have actual real-time information: because the data is tiny, the system can send hundreds of thousands of prices per second. Obviously, no human can handle that, but the systems are all automated now, and currently can place orders in 1.5 milliseconds: they're working on reducing that by tenfold (Universal Trading Platform). That should be the benchmark for real-time search.

Real-time publishing vs. real-time search

Danny Sullivan of SearchEngineLand defines real-time search as "looking through material that literally is published in real time". So Twitter, Flikr and similar systems are real-time, because they require little thought, but blogs and news stories are not. Danny and I disagree on the noun there: he defines it as search of real-time content, where I define it search with an imperceptible time lag from publication. Breaking news, like the San Francisco Bay Bridge re-opening today, can come from anywhere: as far as I can tell, the Bay Bridge Info site had it first.

Thinking about near-real-time indexing, and date vs. relevance ranking

I have some more ideas about near-real-time indexing, and the specific challenges it creates retrieval. The Bay Bridge example is a really useful way to examine how relevance can fit into the picture. I would very much like to hear from anyone who has experience with these problems. Please comment or post links, enlighten me.