Main Nav

Notes - Ensuring the Longevity of Digital Documents

Spring 2009

Coalition for Networked Information

Opening Keynote Speaker:  David Rosenthal

LOCKSS Program, Stanford

Indicating the importance for us to engage in this kind of thinking, Clifford Lynch introduced the talk by indicating that Rosenthal has taken a step back to think critically about our assumptions regarding digital preservation, some which have become enshrined.  He promised that Rosenthal’s talk would change our thinking on digital preservation.

Rosenthal’s Blog -

Notes from Rosenthal’s presentation:  How are we “ensuring the longevity of digital documents?”

Unix still reads every disk it ever wrote – there is no incompatible change to on-desk format and no incompatible change to API.  It’s sound engineering practice that the costs of incompatibility are higher than any benefits derived from incompatibility

He quoted Jeff Rothenberg who said in 1995, “Incompatibility is inevitable, a force of nature,” and asked if this is true 2009?   He proposed that incompatibility is a choice not a necessity and it is worth paying to ensure compatibility.   Rothenberg’s vision was important at the time. It drew a lot of attention which drew a lot of funding.  The Internet Archive was started the next year. (1996)

He suggested that we look at what happened before 1995, then from 1995-2009, and then to the Future – after 2009.   He likened the story he was to tell like Sondheim’s “Into the Woods,” happy endings sometime come in the middle rather than at the end.  (The End is the scary piece)

Rothenberg’s scenario was that there was media degradation, obsolescence, and format obsolescence

With format migration you would still lose some data and in emulation the specification for outdated hardware must be saved in a digital form independent of software which was a previously unsolved problem.

He described the landscape including hardware, OS, and applications in 1995 which were much different than today’s

Indicating that the Web has been the most important change, along with academic publishing moving to the Web, he said that Rosenberg’s 1995 version was for documents in off-line form.  Now if it is worth keeping, it is kept online.  Copy-ability is intrinsic to the online medium. No one cares what the actual medium is that is storing the discs.

Microsoft lost the war with users over the way new computers were loaded with the new OS/software so that people with the old OS/SW had to upgrade.   This led to an anti-trust probe which led to standard (ODF)   

Goal of publishing is to reach as many readers as you can – incompatibility is now self defeating.  On a side note, Rosenthal said “If Google doesn’t index it; then no one will read it.”

Since 1995, virtual machines/virtualization have added change.  Rosenberg was right about emulation but preservation isn’t the reason for doing it.  Also, open source wasn’t mainstream and now it is. It’s a basic strategy now.  Format with open source is safe and thus executable preservation metadata.  Therefore, with 20/20 hindsight, we can see that documents can survive online, migration is inherent, formats are standard, format obsolescence never happens.

Rosenthal cited:  “Increasing Returns & Path Dependence in the Economy” by W. Brian Arthur (1994) which explained the technology habits and showed a chart on the number of competing markets.

IT markets are subject to capture and captured markets resist change – change slows down.  Users migrate to the winner’s format.  [Quote:  “Prediction is very difficult, especially about the future” Niels Bors]

The real problems were scale, cost and intellectual property.  Rosenberg looked at the micro-level but society needs macro-level preservation.  Document-at-a-time preservation is impractical - saving by hand doesn’t work.  The lesson of Google is that there is more value in the connections than in the documents themselves.   Thus this is another instance of Metcalfe’s law – the value of a network goes as the number of nodes on it squared. Google’s other lesson is that it’s very expensive.

The Internet Archive contains 2 Pedabytes and is growing at 240 Terabytes/yr.  It costs 50 cents per Gigabyte per year. which stores academic literature has 50 Terabytes and is growing at 5 Terabytes/yr however it costs $10 per Gigabytes per year.

So the question is “How much do we need to save?” – an Exabyte?   The world doesn’t have the funds to save everything.

Intellectual property says all content has a business model and Rosenthal asked us to please use Creative Commons Licenses.


  • Formats – open source is sage
  • Metadata for docs – hand generated is too expensive, program generated is better.
  • Look at services not documents
  • Preservation implies static – but links would go bad so it can not be static. 

The big problem would be preserving the world the way it is now.

Rosenthal mused on things worth preserving.

  • User generated content [think 2008 election – blogs, YouTube, etc.]
  • What about multi-player games & virtual worlds?
  • Dynamic databases & links to them.

The 2008 preservation buzzword was sustainability.  Preservation in the future will be more difficult and expensive.  Bytes are vulnerable to money supply glitches - more so than paper in the past.  Collection development is critical:  what must be kept?  We need to watch for scaling problems. 

We need to make sure that society has a digital replacement – a fixed tamper-evident record.  Rosenthal referred to the web in Winston Smith’s dream machine [1984] “point & click history rewriting.

Practical next steps:

  • Everyone needs to collect the bits – it is not hard or costly to do a “good enough” job and use creative commons licenses.
  • Preserve open source repositories – they are easy & vital  - no legal, tech, scale barriers
  • Support open source renderers & emulators
  • Support research into preservation technology
    • How to preserve bits adequately and affordably – need it cheaper and more reliable
    • How to preserve the decades’ dynamic web of services – not just last decades static web of pages
    • How to save the dynamic web of now and he future…


Market forces are driving document formats – the questioner said she was not seeing that in all markets, i.e. CAD

At some level it just doesn’t map thus we need to keep the bits and the format and expect that emulation will work