?

Log in

No account? Create an account
SearchTools Blog
Looking at real-time and near-real-time search 
2nd-Nov-2009 07:22 pm
searchtools.com
Real-time vs. near-real-time search

In theory, "real-time" search indicates that the content is indexed and searchable at the same moment as it is posted and readable. To do this, the publishing platform (usually a database) must trigger the search engine indexer at the same time as it formats the publication text. The search engine then adds this to their searchable index, so that someone can click the search button and find the new information at virtually the same time as the text is being published online.

In practice, I have yet to find a system that doesn't have a lag of at least 30 seconds (including Twitter's internal search), and I call that near-real-time search. I don't think that's a bad thing, just that the labels we use should be accurate.

Stock tickers are real-time

Stock traders had near-real-time streaming data for well over a century, first on telegraph and then tickertape. Now they have actual real-time information: because the data is tiny, the system can send hundreds of thousands of prices per second. Obviously, no human can handle that, but the systems are all automated now, and currently can place orders in 1.5 milliseconds: they're working on reducing that by tenfold (Universal Trading Platform). That should be the benchmark for real-time search.

Real-time publishing vs. real-time search

Danny Sullivan of SearchEngineLand defines real-time search as "looking through material that literally is published in real time". So Twitter, Flikr and similar systems are real-time, because they require little thought, but blogs and news stories are not. Danny and I disagree on the noun there: he defines it as search of real-time content, where I define it search with an imperceptible time lag from publication. Breaking news, like the San Francisco Bay Bridge re-opening today, can come from anywhere: as far as I can tell, the Bay Bridge Info site had it first.

Thinking about near-real-time indexing, and date vs. relevance ranking

I have some more ideas about near-real-time indexing, and the specific challenges it creates retrieval. The Bay Bridge example is a really useful way to examine how relevance can fit into the picture. I would very much like to hear from anyone who has experience with these problems. Please comment or post links, enlighten me.
Comments 
3rd-Nov-2009 09:17 am (UTC) - need for realtime search?
Anonymous
I believe the effort for SE companies is not to be able to reach real-time search (with your definition, I disagree with Danny Sullivan) since I think a 30sec lag is not an issue at all in the case of public webcontent (it is for stock tickers and such).

For relevancy I would think a authority model would be able to help. Since this type of search (and for plain search as well) the social factor comes into play more and more. With social sites like Facebook, Linkedin, Twitter etc one is able to calculate an authoritive ranking for the one who posted a message and therfor give the message some ranking...

As for Danny Sullivan's definition: real-time search has nothing to do with publishing content as such. He mixes real-time publishing and real-time search.
My definitions:

- real-time search: content is added to a live search index at the same time it is saved in a/the database.

- real-time publishing: content is published less than 60 seconds after the publisher has started typing the content (I know the 60 seconds is a subjective measure).

Cheers,
Maarten Rooseboom
http://www.qweery.nl
4th-Nov-2009 12:06 am (UTC) - Re: need for realtime search?
I agree that some kind of authoritative weighting is usually a good idea, but the limits were clear during my convenient example of a real-time event: the Bay Bridge reopening. On most of the search results with relevance ranking, the first results were for other bay bridges, old information, and recent but not current status information (which was no longer true).

I'm starting to think that some kinds of queries need both the most 'relevant' and the most recent, perhaps in columns.
6th-Nov-2009 02:32 pm (UTC) - Re: need for realtime search?
Anonymous
for that reason SE's should differentiate their results more:

Reviews/opinions: (fora/blog/user review parts of pages)
Where to buy: shops/auctions/market places
News: recent new documents, newssites, twitter
Video: youtube and such
General information: all other (althoug a further split is probably usefull)

As you can see, the search for websites/images/news/videos is not what I would do. Ppl tend to search for information about a product/service..then what others think of it and if still wanted: where can I buy it (new of second hand).

something I dislike is the aggregation of result, trying to put everything on the first SERP

regards,
Maarten
17th-Nov-2009 03:59 am (UTC) - Re: need for realtime search?
All very good points!

But let's take another scenario, within the enterprise. Something happens (a merger, a new product, a corporate scandal). People go to the internal search engine, but they can't find anything because the index hasn't been updated, even if there are official statements on the event. They go to Google, but there's not much there either.

But if they go to Twitter, chances are that someone will have already mentioned it, even if it's supposed to be confidential. Employees feel like they're getting more from outside than the inside. Opinions propogate, often based on incomplete information, rumors and competitor spin, but they seem true within the echo-chamber created by re-tweets.

Better to have some near-real-time content and search results!
3rd-Nov-2009 03:34 pm (UTC)
I don't really have an experience with real-time searching. Yet, I agree with you in terminology. Real-time is rather without a lag. It is an incredible challenge for a search engine. There is naturally a tradeoff between the speed of updated and the speed of search. For real-time data, one should almost inevitably have a smaller (but slower updatable index) and faster static index. Then you have to merge results, which might not be trivial at all. E.g. the static index has document D, but in reality it is already deleted. More importantly, that other data that affects ranking, e.g., incoming links also change in real time.
4th-Nov-2009 12:09 am (UTC)
I think near-real-time search is a good enough goal for most things.

The merging stage is clearly vital. But the relevance ranking could be simpler, with phrase matching more important than incoming links, if an item is new.
4th-Nov-2009 12:31 am (UTC)
Agree, but near-real-time is still a challenge.
PS: Just phrase matching might be ok for rare phrases, but for common phrases it will be a complete mess. Imagine some new gadget, device, OS, etc named X is released and thousands if not million bloggers are posting. How do we separate trustworthy sources and "white noise"?

Edited at 2009-11-04 12:36 am (UTC)
4th-Nov-2009 09:55 pm (UTC)
Good points!

My test string was bay bridge which worked out really well: those are very frequent words both as a phrase and separated. However, it was an incredibly popular phrase within the posts. So results which came from about the same time, but had the phrase, were more pertinent than the ones with the words in a different order. Somehow we have to build this in.

Authoritative sources are another part of the puzzle. There was a very interesting discussion of both source and item value around December of last year, I have links and summaries. It's interesting that none of them expected a simple newest-first result list.
3rd-Nov-2009 03:45 pm (UTC) - Some have it already
Avi, Dieselpoint has had real-time search for a long time. My definition of "real time search" is that you get the same insert performance from a search engine that you would get from a database. That is, if you can insert a record into a SQL database and have it show up in a query in X milliseconds on given hardware, then your search engine should be able to do the same. As I said, we've been doing it for several years.
4th-Nov-2009 12:10 am (UTC) - Re: Some have it already
Hi Chris,

That's a pretty good rule of thumb, though some databases are ridiculously slow. I'm currently defining near-real-time as anything 30 seconds or faster.

Do you have any good examples with lots of updating?
4th-Nov-2009 12:31 am (UTC) - Re: Some have it already
Some databases are slow, but if you combine a slow (but real-time database) with a much faster semi-static index, the performance would be ok. Moreover, there are a new generation of column-oriented databases, which can be used to store and update inverted indices more efficiently. A new thing is http://www.infinidb.org/resources/tech-articles/69-introducing-infinidb-from-calpont
Though, I have not tried it, it is a promising thing.
This page was loaded Oct 14th 2019, 5:42 pm GMT.