Entity extraction: not just for metadata any more 
16th-Dec-2008 03:45 pm
CiteSeer is a nifty free service that indexes and searches academic papers (mostly in CS and various Information Sciences). It's been doing Automated Citation Indexing for a decade, linking cited and citing papers, which turn out to be extremely valuable for research and area studies.

Now, Professor Lee Giles and his students at Pennsylvania State University have rebuilt the system from scratch, and are sharing it, using an open source Apache license, SeerSuite (currently beta 0.1). Even a smallish digital library can take advantage of the automated metadata extraction and citation linking, with the reliable Lucene search engine underneath.

Now, they're combining technologies (from OCR, to machine learning), and reverse-engineering data from PDF documents. This includes extracting captions and numbers out of tables, chemical formulae and molecular structures, mathematical equations, and 2D graphs, storing them in various standard markup formats. All this information is not metadata, it's source data, and incredibly valuable for avoiding duplication, allowing reproduction of experiments, and taking that data in directions the original researchers did not expect. It brings the research into the Semantic Web, where there are tools just waiting for data like this.

I wrote a bit more about CiteSeerX and SeerSuite in InfoToday, and there's more information at the CiteSeerX site.
