Advertisement

Net Archive Turns Back 10 Billion Pages of Time

Share
TIMES STAFF WRITER

An Internet archive containing more text than any library in history will open its digital doors today, giving researchers and the public access to just about everything posted on the World Wide Web over the last five years.

The free archive, created by a San Francisco computer entrepreneur named Brewster Kahle, allows academics to conduct the electronic equivalent of archeological digs, rooting through reams of material illustrating the evolution of the Web and its role in American society.

The Internet Archive, informally called the Wayback Machine, holds more than 10 billion Web pages dating to 1996, including millions that had vanished as dot-coms collapsed, big companies scaled back or updated their offerings, and hobbyist Webmasters lost interest.

Advertisement

Researchers and academics have likened Kahle to a modern-day Andrew Carnegie, the steel baron who endowed many of the nation’s finest libraries.

“Libraries are dedicated to collecting and making available the permanent historical record,” said Diane Kresh, the Library of Congress’ director for public service collections. She said trolling the Net is as significant as gathering books or periodicals.

Want to see what the Heaven’s Gate cult page looked like before the group’s mass suicide? There it is. Want to see how Yahoo’s pages have changed since 1996? Step this way. Pages published by everyone from Fortune 500 companies to renegade porn merchants are stashed in the Internet Archive.

The five-year, multimillion-dollar project has amassed five times as much text as the Library of Congress, which helped fund the archive along with Compaq Computer Corp., the National Science Foundation and the Smithsonian Institution. The more-than 100 terabytes of data are housed on 300 modified Hewlett-Packard desktop computers in a basement at San Francisco’s Presidio.

The effort to record Internet history has been directed and largely financed by Kahle, a 41-year-old former supercomputer technologist who sold one Web firm to America Online and another to Amazon.com.

“The opportunity of our time is to offer universal access to all of human knowledge,” Kahle said Wednesday from his office in the Presidio, a decommissioned military base near the Golden Gate Bridge. “We’re at a unique point in time to offer universal access to anyone who walks into a library in Uganda.”

Advertisement

The Internet Archive uses automated “bots” to scour the Web. They capture sites and return what they find to the computers at the Presidio. The archive updates every two months. Once captured, the sites are organized chronologically. Users type in a Web address, and the archive displays versions of that site since 1996.

Sites that require passwords or block bots are not captured. And if someone objects to their site being copied, the archive removes it.

As smaller, less accessible versions of the archive were being compiled, Kahle’s 30 staffers got a few complaints. After the staff explained that it wasn’t personal, that they were copying everyone’s sites, the vast majority decided they didn’t mind, Kahle said.

“Most people say, ‘You’re crazy, but go for it,’ ” Kahle said. “People want to be part of history.”

Candidates to use the service, at web.archive.org, include academics, journalists and researchers.

“It will allow researchers to study the evolution of the Web in a way that is unprecedented,” said research scientist Ed Chi of the Xerox Palo Alto Research Center. He said Xerox PARC scientists already are working on new user interfaces based on what the archive showed them about how people looked for information.

Advertisement

Early on, “we suspect people will go look for their own pages and see if they can get copies of things that they’ve lost,” Kahle said. “We’re not exactly sure how this is going to be used. We’re looking forward to being surprised.”

Like many Internet pioneers, however, Kahle faces unfamiliar risks along with the opportunities. The Internet Archive may be a massive violation of copyright law.

“Brewster is taking an extraordinarily personal risk, because this is potentially a criminal offense,” said Lawrence Lessig, an expert on intellectual property in cyberspace at Stanford University.

Kahle doesn’t anticipate getting sued, let alone serving jail time. His plan is to post whatever he can--and keep the archive growing.

“We’re not here to test laws,” Kahle said. “We’re trying to build a world we want to live in. The world without a library is a world without a memory, and that would be tragic.”

The legal questions may take years to resolve, Kahle and Lessig said.

Consider the Industry Standard. At least some of that defunct magazine’s articles are back online through Kahle’s archive. But shareholder IDG paid more than $1 million for the Standard’s assets, including rights to those stories. An IDG spokeswoman declined to say whether the company would ask the archive to drop the articles.

Advertisement

Kahle said he isn’t worrying about the hypotheticals. He’s more excited about finding early www.whitehouse.gov pages from 1996 that dealt with airport safety and bioterrorism.

Even better is what’s to come.

“The woman who is going to be elected president in 2024 is in high school now, and I bet she has a home page,” Kahle said. “We have the future president’s home page!”

Advertisement