Main Nav

Toward Universal Access to All Knowledge - NOTES from a talk by Brewster Kahle

Toward Universal Access to All Knowledge - NOTES

This presentation was recorded for podcast and is available at

Speaker:  Brewster Kahle

Director and Cofounder, Digital Librarian, and Chairman of the Board, Internet Archive

Notes –

Kahle gave a fabulous talk at the Western Regional Conference on the topic of universal access to all knowledge.  He prefaced his remarks by saying that there is a real blur growing between IT and Libraries where the content layers are beginning to merge.

He looked at what people have carved in stone.  At the Boston Library it is “Free to All”

His presentation will include discussion of the scope and the issues for universal access to all.

Content -

Library of Congress (LOC) – 26 million books - a book is a MB so a total of 26 Terabytes for all works in the LOC.  At the cost of storage we can afford 26 TB so the ability to make this quantity of information available – it is within our grasp.

He said that we need to be able to search inside books with search tabs - a very “booky” exercise to make the information accessible.

In his international work, the first lady of Egypt – “loves her books” and wants to be able to share them widely -  which led to a discussion on “how hard would it be to make books-on-demand.”   [It turns out to cost $3 to loan a book from Harvard Library.] He looked into creating a bookmaking machine – there are now mobile bookmaking machines in Alexandria, Egypt, Uganda, and more.  The print-on-demand/binding machine costs $100K.   But now, Netbooks, One Laptop Per Child (OLPC), electronic/digital readers, computers and specialized devices are coming around and making digital materials more accessible.

Kahle talked about the million books project in India – 600K books scanned to date, in China more than 1 million are scanned now and they are headed for 3 million including 50K Arabic texts.

They designed a new scanner and have established 18 scanning centers in 5 countries and are scanning 1K books a day. The cost is down to 10 cents /page – that includes maintaining the text online forever. [1.3 million now and cost is $30 million to get a million books online – same $3 per book as it costs to loan a book from the  Harvard Library]

Microfiche and microfilm collections - we now have mechanisms to get them online as well and so can add newspapers/magazines soon.

What they have:

There are 8 collections – from children’s books to Arabic texts


There are 2-3 million recordings which is ‘just not that big’ (though heavily litigated) so, he asks:  How do we do this without going to jail?  They collect on the edges from those not interested in money.  For example, the Grateful Dead encouraged their fans to record their concerts and these are all available digitally now.

They offer unlimited storage, unlimited bandwidth, forever, for free to musicians and get 2-3 bands a day, 40 songs a day, and have 30K concerts and more.   Kahle suggests that it shouldn’t cost you to give something away and noted that anyplace else you get a reward for giving something away – but not on the Internet. 

Open audio essentially costs $10 per disk /$10 per hour.

They now have 200K items in over 100 collections

Moving Images

When we think of movies we think of Hollywood films of which there are only 150-200K.   However, there are about 1000 in the public domain.  The more popular are the training and propaganda films from the past.  They are rabidly popular – used as ‘kitcsh’ things.   YouTube is now doing 99.9 % of these, [Lego film community, political debates, etc]

Television (400 channels of original programming) began to just record unto hard drives 24X7  and now we can quote and critique.  Work with LOC and others to prioritize, ie,  Sept 11th materials are now saved and available so we can see the different points of view from around the world.

Video costs $15 per video

It is possible to do all of these.

Ultimate issue is saving software.  Many think of this as saving SW boxes instead of the  SW but it’s the SW that needs to be saved.

Kahle said “Good thinking starts with good passion…” and it is clear that he and his organization are passionate about what they do.


They are capturing 4 billion pages every two months – and doing more archiving now – all to build collections.  Their mantra is that “If you don’t grab it – it might go away.” 

The Internet Archive and Wayback Machine both have a rapid take down if inappropriate materials are captured.

They have 4.5 Pedabytes of materials and 500 hits per second.  They are the largest single database.

Preservation:  in 1997 they had 2TB which were moved in 1999 to tape-robots and later to discs.  He noted that there is loss and risk with these “moves”   

He noted that in ancient Alexandria, the great library burned down and now only a handful of important ancient texts are left today.

Today they make digital copies.  In 2002 the Bibliotheca Alexandrina began keeping a copy of the Internet Archive.

Our history is now in digital form.

Another of their data centers is in the heart of San Francisco.  They had 2 PB in 2008.

Next generation machine is a 20X8X8 box - a Wayback machine 3 PB of materials and it sits outside.  Just plug in cold water, power and computers.  It has low capital costs.

[At this point Kahle noted that human error drives data loss.]

Tomorrow the technology may look like the Jedi library in Star Wars.

Amsterdam (independent group) starting another digital repository similar to the Internet Archive.

Digital collections:

  • When it’s about Cultural interests it’s “mine”
  • Bio/science on the other hand is sharing so it’s “ours”


  • Catherine Norton – Woods Hole, is data-mining for life expectancy information
  • Luis von Ahn – Carnegie Mellon University is using people as a distributed computer
  • Stephen Wolfram – Mathematica - Read equations from papers – makes sure the computer understands
  • Greg Crane, Tufts - plotted place names from Greek literature against Google maps 

[We are indexing in new ways – but to do so we must have data in bulk]

  • Phil Resnik, University of Maryland – is translating from one book to another – looking at books in different languages
  • Larry Lessig, Stanford – needs access to library catalogs which aren’t really accessible

Open Library – one webpage for every book – We need to use Wikipedia to make a new universal library catalog  (open library has 22 million records)  LOC has 26 million books but only 12 million are cataloged.  All is “Wikipediable”

Kahle made a Pre-Announcement of an Open Repository for the Commons

  • S3 protocol (Amazon – hosted storage)
  • Zotero Commons: shared scholarly works
  • Fedora compatible
  • Free, open, Value added: free OCR, indexing sharing

Building the library

  • We’re all in the library world now (plumbing is done- so we have to build it ourselves)
  • Give away everything in the pubic domain and loan everything else. 
  • Encourage research use (bulk downloads)

Our library’s digital transition so far has moved

  • Local control to centralized control
  • Nonprofit to pro-profit
  • Diverse to non diverse
  • Open to closed
  • Google – has lots of terms with constraints
  • Open vs “open”
  • Sharable vs “sharable”

Kahle suggests that we build the library system we’ve dreamed of and mentioned the Open Content Alliance – and said “University access to all knowledge…can be one of our greatest achievements.”

Carved in stone at the Carnegie Library in Pittsburgh – “Free to All People”


Images – please help – working with NASA (without $$) but have permission to add them to the collection.

Info swap with the Bibliotheca Alexandrina

150MB/second - If you are on Internet2 – you can keep up.  (8mb connection to the Internet – most is outbound)  [20% of the Abilene backbone]

Stay away from commercially viable works for now

Periodical literature needs to be saved especially in Biology

Currently they have published cultural data

Need services on top of data – backend bulk is the current corner data sets of the Internet Archive.  For example, rating systems - reviews, stars, downloads, are standard but they are now looking at more of the “if you liked this one you’ll like that one”

The User community is their biggest asset in finding and fixing problems.

Moving Images need work for search-ability and accessibility

Kahle graciously invited attendees to visit their downtown San Francisco location the following day after the last session.  About 30 people attended the tour, had a glass of wine, and took home an Internet Archive glass as a souvenir.


This presentation was recorded for podcast and is available at