SearchTools Blog (searchtools) wrote,
SearchTools Blog

  • Mood:

The UK Web Archive

The British Library and IBM are working together on the UK Web Archive, which will store all accessible UK web pages, providing researchers with a great datasource of British academia, opinions and popular culture that may change radically or disappear without notice.

IBM is providing software expertise, and using it as a testbed for text-mining Big Data, estimating that it will be 220 Terabytes per year as of 2011. BigSheets (presumably a pun on BigTables) includes both open and closed source software. They have shown various interfaces including spreadsheets, tag clouds, and mutli-bubble charts.

I wrote an article about it for InfoToday: British Library and IBM Team Up on Web Archiving Project.

Some of my thoughts that didn't make it into the article:

The British Library is a Legal Depository, holding one copy of each book and book-like object that's published in the UK and Ireland. In the past, their archive was limited to sites where they have managed to find the owners and get permission to copy, so about six thousand, including companies which no longer have an independent existence, such as the Woolworth's site, which has since been removed from the web. Obviously web search engines and the Internet Archive have taken a different approach. But while the UK Legal Deposit Libraries Act in 2003 seems to give them permission, it yet hasn't been enacted, and they've been in legal limbo. The announcement seems to be a way to pressure their parent department of Culture, Media, and Sport to implement the new rules as soon as possible. For more details, see the Wired UK article: Archiving Britain's web: The legal nightmare explored.

The UK Web Archive will honor robots.txt convention and the Oakland Archive Policy developed by the Internet Archive and the UC Berkeley Information School.

Published responses to the Archive announcement have ranged from the positive: British Library launches UK internet archive The UK's national library has created a fascinating snapshot of the way Britons have been using the web since 2004 to the alarmist: UK Web Archive will offer just 1% of websites by 2011 to the negative: British Library wants taxpayer to gobble the web

Software mentioned by IBM relating to BigSheets:
  • Hadoop open source scalable data handling

  • Pig Latinopen source query language for Hadoop

  • Nutch open source web crawler

  • Open Calais - not open source, but freely available from ThompsonReuters

  • IBM InfoSphere for classification

  • IBM ManyEyes for visualization

There's also some completely un-sourced claims that 'the average life expectancy of a website was just 44 to 75 days, and suggested that at least 10% of all were either lost or replaced by new material every six months,'. I have some leads on where this information came from, and it looks quite old, as in possibly from 1998. Anyone out there have actual research data?

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.