News

Internet in Daily Life: Preserving the Net - 2003-03-08

October 30, 2009 1:38 PM

A feast of information is available on what are called "web pages" on the Internet. The computer search engine "Google" estimates there are now 10 billion different sites filled with these cyberpages. By many accounts, half will be gone by year's end, replaced by other websites. But public and private efforts are underway to save much of this digital material, just as libraries shelve books and keep films in storage.

One dramatic initiative to preserve America's digital heritage has just been announced by the Library of Congress, the world's largest library.

It receives thousands of donated digital items, many stored on discs called CD-ROMs. And as the nation's copyright agency, the Library is sent thousands of websites from owners wanting to copyright them. But until now it has not archived much digital information produced by outsiders. And the only digital information the Library has made available on the Internet is its own material, on the Library of Congress website.

Associate Librarian of Congress Laura Campbell says that, with a $100 million appropriation from Congress, it will now begin building a national digital library, parts of which the entire world will be able to access. "There are going to be multiple ways in which we retain material because websites are different. Some are more interactive and dynamic than others," she says. "Some have links to deeper data bases. So we will retain material, both as a snapshot or for cultural heritage purposes to say, take a look at the Internet, for instance at a certain period of time. We'll also retain material by subject matter. Once you capture it, that's only the beginning. How do you migrate it to a future life?"

Landphair: "How do you even know what the technologies will be twenty years from now?"
Campbell: "You know they will change. That, you can count on. And you can count on that being a rapid process."

Other big research libraries and some corporations have been saving copies of their own websites for years. And Google offers its subscribers what it calls its "cache" option, a digital imprint of previous pages that appeared on the Google website. This is useful to subscribers when web pages they've been looking for are withdrawn from public view.

For instance, earlier this month [March 2003] when subscribers went to the website of the heavy metal band "Great White," they found that the site had been withdrawn after the band's involvement in a deadly night-club fire in Rhode Island. But a search of Google's cached websites produced several pages of the band's earlier sites and related links.

Since 1996, a nonprofit organization called The Internet Project, based in San Francisco, has been going Google one better. Each month, it "crawls," or makes digital copies of, more than one billion web pages on 15 million websites for historical reference. The Internet Project will be one of the partners providing material to the Library of Congress.

Brewster Kahle is "digital librarian" at the Internet Project. He puts together its online cyber-library of web pages called the "Way Back Machine." Unlike the Google cache, which is tapped primarily by casual Internet users, the Way Back Machine is designed for researchers, historians, and other scholars.

Kahle: "Every two months, there's a full sweep of all the publicly accessible websites, except those that have indicated they don't want themselves saved."
Landphair: "Does this take just an enormous, Univac-size computer (a room-sized, 1960s-era, computer) to do this sweep?"
Kahle: "Yes. It's made up of many, many personal computers. So that's the new Univac - racks and racks of simple Linux machines. And the current Linux machines can store about one terabyte each. If you take a book, the letters in a book, it's about a megabyte. A million megabytes is a terabyte. So a million books [worth of space] is a terabyte."

Mr. Kahle says digital libraries like the Way Back Machine, which archives web pages going back five or six years, are needed so people cannot hide from, or change, the past. They provide an irrefutable record that can be studied.

Kahle: "Now that publishing has moved onto the Internet, we need the library. The average life of a web page is one hundred days. Some of the best works that are being created right now are accessible and only accessible on the Web. And if we don't actively save them, they'll be gone forever."
Landphair: "The sites and the pages that you preserve, do they include, for instance, adult, pornographic sites?"
Kahle: "Yes. We try to be inclusive of all publicly available websites."

Like many creative products, web pages are the intellectual property of their creators, who place them in the public domain on the Internet. There, they realize, anyone and everyone is free to look at them. And it's hardly surprising that people copy material from these sites, even though copyright warnings tell them not to.

In this vein, website creators like Philippa Gamse, who is an e-business strategist in California's Silicon Valley, ask an interesting question. "Site owners, in fact including myself, have discovered previous incarnations of their site being displayed on these archives. There could well be a lot of times when you don't want this to happen, because something has changed about your site, about your business, such that you really don't want to be publicly saying what you used to say," she says. "And occasionally, you know, it happens that something on your site turns out to be illegal, and you may not have known it when you put it up. And my concern is that if your site's being archived out there without your knowledge, then there could be things that are being archived that you really wouldn't want to be preserved."

The Internet Project's Brewster Kahle says the Way Back Machine will immediately delete from its digital archive any website whose owner does not want it to appear. As for the Library of Congress, all digital material that arrives via the copyright process and is selected for archiving will be available to researchers who visit its reading rooms in Washington. But Laura Campbell says it will seek website owners' permission before putting stored web pages on the Internet.

There's one fitting epilogue to the phenomenon of Internet archiving. Librarians learned a bitter lesson from the total destruction, over 400 years of fires and warfare, of antiquity's greatest library, in Alexandria, Egypt. The lesson is that all holdings should be backed up and shared. The Internet Project does just that. And one of the copies of its periodic scans of millions of Internet pages is specifically sent to the new library in Alexandria.

For anyone interested, the web address for the Way Back Machine digital library is www.archive.org.

Landphair report - Internet in Daily Life: Preserving the Net - feature - rm - 07mar03

Internet in Daily Life: Preserving the Net - 2003-03-08

Related

Landphair report - Internet in Daily Life: Preserving the Net - feature - rm - 07mar03

Links

Way Back Machine - Internet Archive

Way Back Machine - Internet Archive

Accessibility links

Follow Us

Internet in Daily Life: Preserving the Net - 2003-03-08

Related

Landphair report - Internet in Daily Life: Preserving the Net - feature - rm - 07mar03

Links

Way Back Machine - Internet Archive

Way Back Machine - Internet Archive