Sunday, June 01, 2008

Virtual moonbeams: the impossible task of capturing the web

Unquestionably one of the most significant consequences of the digital age has been the wholesale transformation of the archival business. A business which was once dominated by the maintenance of intricate card catalogues and the ultimately futile battle to preserve physical media (e.g. books, newspapers, canvases, wax cylinders, photographic film, magnetic tapes) has been turned on its head by the ability to create a perfect digital copy of almost anything, which can then be stored, accessed, duplicated and distributed, all without a degradation in quality. Coupled with cheap and abundant storage, there is seemingly no longer any reason why media should be lost, as has happened so many countless times in the past.

Or is there...?

Another major consequence of the digital age has been a relentless proliferation of data and a gradual shift from static to dynamic publishing. The early boast of CD-ROM manufacturers that it was possible to store the entire Encyclopedia Britannica on a single disc seems wonderfully quaint in the light of the internet's many petabytes of data; data which isn't republished on an annual or quarterly basis, but is constantly growing and changing.

In the early days of the web it seemed almost achievable to maintain some sort of an archive by crawling the web and taking snapshots, which led to the creation of the Internet Archive's marvelous Wayback Machine, which has archived an incredible 85 billion web pages from 1996 to the present. However, even this gargantuan effort is riddled with holes, dependent as it is on data from Alexa. Even the pages which have been crawled and archived often have missing images or other non-HTML components (see the original BBC Radio 5 Live homepage), which leads us to a new challenge facing the would-be web archivist: the move from static to dynamic web pages.

This shift can be illustrated by the recent changes to the BBC homepage. Although not actually dynamically published (BBC Programmes provides an example of genuine dynamic publishing), the BBC homepage has changed from a single page (two pages if you count the International version), with discreet updates which could be tracked and logged, to a customisable page of feeds and modules with thousands, if not tens of thousands, of possible permutations. What's more, most other major sites are much further down the dynamic, data-driven road than the BBC and are effectively just big databases, spitting out data on request (often to a variety of platforms), rather than assembling and publishing discrete webpages.

Of course, the shift towards dynamic pages isn't the only challenge to archiving the web. Other obstacles include subscription services, which either hide content behind a paid-for wall (e.g. or require regular payment for media to be maintained (I've often wondered what will happen to my Flickr photos when death or bankruptcy forces me to stop paying my annual Pro account subscription). Use of the lowly robots.txt file and nofollow attribute will also ensure a big chunk of the web isn't automatically crawled and captured.

The deletion of content and reuse of URLs are two other major problems. I was hoping to link to the BBC's 'Book of the Future' in my previous post on collaborative storytelling, but was alarmed to discover that not only had the site been taken down, but the URL which had been used to promote the site ( was now redirecting to a page on the future role of public service broadcasting.

So, how do you solve a problem like archiving the web? The two most likely solutions to my mind are 1.) a massive, open, SETI@home-style distributed networking approach 2.) Google does it. Whilst the former is unquestionably more ideologically appealing, the latter seems infinitely more likely. Unashamedly on a mission to "organize the world's information and make it universally accessible and useful", Google already keeps a temporary archive in the form of its Google cache and discrete archives around some of it's products (e.g. News, Zeitgeist). The company did register a raft of archive related domain names (e.g. in September 2006, prompting a brief flurry of speculation, which quickly died down (after all, you can't read too much into Google's domain registrations, as this compendium demonstrates -, anyone...?)

One service which probably isn't about to revolutionise the wholesale archiving of the web, but may just be a portent of the future, is iterasi. Originally unveiled at DEMO in January and launched as a public beta last month, iterasi is a dynamic bookmarking service which, in their own words, "makes it simple for any Web user to save the dynamically generated pages that are increasingly becoming the bulk of today's Web experience". The service works by means of a browser plug-in (IE 7 or Firefox 2, but currently PC only - although they confirmed by e-mail that they're working on a Mac-compatible version) which enables you to "notarize" any page - saving it to your iterasi account, complete with the description and tags of your choosing, from where it can be viewed, emailed and embedded. Below is a "notary" of the BBC homepage, captured at 9:07 this morning.

I think it's a fantastic service and can't wait for the Mac-compatible plug-in so I can fully integrate it with my online life. Whether it marks the start of a more nuanced approach to capturing the dynamic web, only time will tell. The smart money, as ever, is on Google.


Anonymous said...

Dan - Thanks for the good words about iterasi! The advent of dynamic content creates an entirely different universe. It's cool to to think it's a part of the web we are just starting to explore.

Anonymous said...

Sean McGrath discussed this issue at last months's XTech Conference. He identified three models currently operating on the web:

1. Pages on the server that are then statically published.

2. documents existing on the server but dynamically rendered transforming the content in the process using, for example, CSS and JavaScript.

3. nothing existing until you observe it. The document is composed and rendered when requested on the client - Just In Time programmatic generation of content.

I think we have a 4th class - like /programmes - where pages are rendered JIT on the server.

But in any case it is model 3 that's causing the problems you rightly highlight. At the most extreme the pages really can't be thought of a being part of the web - they can't be indexed by Google bots, bookmarked or linked to.

And that's before, as you point out, you get to reusing urls. Sigh.