Advertisement

Archiving the Internet: Some See Noble Experiment, Others Fear Excess of Trivia

Share

Historians often struggle with a dearth of firsthand accounts of important events, cultural changes or people from past centuries. The study of modern times often presents the opposite problem--researchers are quickly buried in an avalanche of data, held back only by government and corporate secrecy.

How would historiography change if we could save everything--every document, snapshot, video clip, song or public utterance that anyone chose to preserve--in a form that can be quickly searched, sorted and understood?

The prospect is not farfetched. Search engines are already gearing up to scour the entire World Wide Web, something they didn’t even attempt a few months ago. The processing, digital storage and database technologies to manage archives of unprecedented scope--Web-sized data sets--are improving at an exponential pace.

Advertisement

To be sure, experts say that we are still far from archiving the entire Web; few even think such a venture would be desirable.

But voluminous digital archives are already eclipsing anything that preceded them. They pose profound, yet puzzling implications for our collective memory.

The Internet Archive, a San Francisco-based nonprofit, holds its first-ever conference this week to open a dialogue about the challenge and majesty of building the largest informational inventory of all time. The group’s collection already totals 1.2 billion pages, and 120 million pages are added each week. At this rate, it will soon eclipse the volume of book holdings in the Library of Congress.

The collection consists of donated Web “snapshots”--static moments in Web time--captured by Alexa Internet, a search company owned by Amazon.com. It consists only of the public portions of the Internet.

The idea of an Internet archive seems almost an oxymoron, given that the estimated life span of an average Web page is three to six months.

Why create a lasting record of the Internet equivalent of bar chat?

“One of the most important communications media of our time is being born and is disappearing,” said Peter Lyman, a professor of information management at UC Berkeley and a board member of the Internet Archive. The Web’s early history is being erased, just as many television programs of the 1950s were lost when producers reused then-precious videotapes.

Advertisement

Brewster Kahle, co-founder of both Alexa and the Internet Archive, points out that the greatest libraries in human history have collected only a tiny fraction of what humanity has to offer. Alexa’s namesake, the Library of Alexandria, Egypt, compiled about 500,000 scrolls between about 300 BC and AD 400. Today’s Library of Congress, with some 17 million volumes, boasts only 34 times the holdings. The public portions of the Web are already vastly larger.

“We are witnessing a situation that is not any different from any other field of science,” said Bernardo Huberman, an Internet researcher at the Xerox Palo Alto Research Center.

Why do botanists try to catalog every plant on Earth? he asks. Even if the purpose of a comprehensive Internet archive can’t be articulated clearly, experts believe it will prove fertile ground for anthropologists, sociologists and historians.

“The art of archiving is the art of selective forgetting. You really don’t want to remember everything,” said Lyman, “Yet one generation’s decisions would be questioned by another generation.”

He offers this example: Newspaper ads from the 1920s were saved only because libraries archived the full publications. Decades later, scholars found the ads a much better source of insight into fashion and cultural trends of the time than articles in those papers, which viewed such matters as too commonplace to cover intensively.

With a cheap scanner you can now digitize every charming scrawl of your 2-year-old and put it up on your personal Web site. Too trivial for the Internet Archive? How do you know that little Jennifer won’t be a famous artist in 25 years whose early genius will be sought after by art students and collectors?

Advertisement

The beauty of being thorough is that a Web archive could be the most democratic lasting record in history. The Library of Alexandria didn’t save scrolls of a slave’s poetry.

Yet a comprehensive archive offers its own perils.

Some experts wonder whether the very creation of the mother of all databanks could distort the impact of the information that it holds--burying golden nuggets under a mountain of brass, or building false confidence.

The Web may already be leading to a distorted sense of the historical record--at least on the part of college students--said Richard Cox, a professor of archiving at the University of Pittsburgh. Students think they have the world at their fingertips on the Web, he said, “but [it offers] only a fraction of the information that’s out there on any topic or issue.”

And Lyman notes that the mere collection of certain kinds of data tends to distort its meaning by placing it out of context.

“With e-mail and chat, people’s working assumption is that they disappear the same way words disappear--they carry for 20 feet and fade out,” he said. “It’s a mistake not to let some things be ephemeral.”

And the desire to keep everything looks a lot like technological narcissism in light of the challenge of storing all those bits. Consider that books in the Library of Congress might last 500 years before crumbling to dust, while storage techniques and operating systems needed to compile an Internet archive can become obsolete in a hundredth of that time. Without continual transfer to new formats, the archive would be useless.

Advertisement

Still the grand experiment of capturing the Internet as we know it today seems worth a try--a noble experiment that shows the spirit of the Internet beyond today’s hype-ridden gold rush.

*

Times staff writer Charles Piller can be reached at charles.piller@latimes.com.

Advertisement