Net Searchers to Index All 800 Million Pages

By ASHLEY DUNN

Aug. 3, 1999 12 AM PT

Share via
- Email
- Facebook
- X
- LinkedIn
- Threads
- Reddit
- WhatsApp

TIMES STAFF WRITER

Stung by criticism that search engines have fallen hopelessly behind in indexing the 800 million pages of the World Wide Web, several search companies have launched themselves on a Herculean effort to scan and review the entire expanse of cyberspace.

Excite@Home, which operates Excite, the third-most popular search engine, Monday announced plans to look at the Web’s entirety using a new technology that will be deployed in the next few weeks. Excite now has indexed only about 50 million pages of the Web.

But some critics suggest that all this effort may be nothing more than a massive waste of resources--essentially a marketing scheme that will mean little to the average user and may even be counterproductive by vastly expanding the number of irrelevant results on a search request.

“What does it mean to have another 100,000 or 200,000 links show up in a search?” asked Jakob Nielsen, co-founder of Nielsen Norman Group, a Web usability consulting firm. “It is 100% irrelevant.”

Still, the push to become the biggest search engine in cyberspace has already begun to gain momentum, driving a variety of companies into the fray.

“The whole idea of bigger is better is back with a vengeance,” said Danny Sullivan, editor of Search Engine Watch.com, a London-based online magazine dedicated to the online search industry.

Norwegian search engine company Fast announced Monday that it plans to catalog all of the Web within the next year. The company also claims to be the current index champion at more than 200 million Web pages.

Inktomi Corp., which produces one of the most widely used search engines on the Internet, said that it too has begun to feel the pressure to keep up.

“We’ve seen a resurgence of the idea: big, big, big,” said Kevin Brown, director of marketing for Inktomi. “Relevance of results is still the leading issue, but we intend to grow our index substantially too.”

While part of the movement may be just an effort to gain bragging rights in a highly competitive industry, the current arms race between search engine companies touches on a Holy Grail of the Internet--cataloging the entirety of humanity’s online knowledge.

So far, the search engines have done miserably in the task. A study by scientists at the NEC Research Institute found that even the best search engines today have found no more than 16% of all Web pages.

The study, published in July in the scientific journal, Science, raised the unsettling question of whether the Internet could actually lead to a step backward in the distribution of knowledge as more information was lost than gained because of the inability of the search engines to keep up.

The scientists found that most search engines index less than 10% of the Web. Even by combining the efforts of all the search engines, only 42% of the Web had been indexed.

Kris Carpenter, director of search products and services for Excite, said she believes that most consumers still do not want all 800 million pages of the Web--a large percentage of which consist of vanity sites or extremely obscure data.

But she added that it has become more important to at least scan the entire Web so the search engines can make better decisions on what is important.

Most search engines use programs known as “spiders” to search out new Web sites and monitor those that already have been indexed.

Excite now uses fewer than 10 spiders to cover the Internet, but with its new technology, it will begin deploying dozens--each capable of covering up to 35 million pages a day.

Currently, spiders index virtually all of the pages they visit. Carpenter said the new system will distill the entire Web down to about 250 million pages that meet automated standards, such as being widely linked to other pages, and display results to search requests only from that smaller list.

“You can have an exponential increase in the amount of known content,” Carpenter said. “Visiting the whole Web is what makes the difference.”

Nielsen countered that all this talk of growing huge indexes makes no sense given the current state of search engines, which already tend to overload users with hundreds, if not thousands or tens of thousands of useless Web sites.

“The only thing that matters is the top 10 links you get back,” he said. “Maybe you click on the next page of results, but that’s rare.”

Nielsen said the problem with automated search engines isn’t so much their reach, but an inability to make even the most basic decisions on what is relevant, important and worthwhile.

Humans can do this easily, as shown by the growing popularity of directories--categorized lists of Web sites selected by human editors.

Nielsen said that search engine companies already went through a “bigger-is-better” phase, until the Web began to become unwieldy. Those companies temporarily abandoned that approach when it became clear that consumers simply wanted better results, not more.

Brown of Inktomi said that having huge databases can even make things worse by diluting a pool of good Web sites with hundreds of millions of obscure pages.

“Without improved relevancy, you can hurt the user,” he said. “Honestly, when you’re above 100 million pages, there’s just so much there already.”

* THURSDAYS IN CUTTING EDGE: Your Internet Guide spotlights recommended Web sites.

Technology and the Internet World & Nation

Ashley Dunn

Ashley Dunn is the former weekend editor at the Los Angeles Times. He previously served as assistant managing editor in charge of California news. Dunn joined The Times in 1986 as a suburban reporter in the San Gabriel Valley and later moved to the Metro section, where he participated in coverage of the 1989 Loma Prieta earthquake and the 1992 Los Angeles riots. After a stint at the New York Times, Dunn returned to Los Angeles in 1998 as a reporter and then editor in The Times’ Business section. He later was named to run science coverage. He worked as deputy national editor from 2007 to 2011 and played a central role in the coverage of some of the biggest national stories of recent years, including the 2010 oil spill in the Gulf of Mexico and the 2008 election of President Obama. Prior to his career at The Times, Dunn worked at the South China Morning Post in Hong Kong, the Danbury News-Times in Connecticut and the Seattle Post-Intelligencer. Dunn is a California native who has worked as a dishwasher in Sacramento, a printer in San Francisco and a bicycle repairman in Walnut Creek. He has lived along the levees of the Sacramento Delta, the Powell-Hyde Street cable car line and the shaded streets of Pasadena. He is a graduate of UC Berkeley with a degree in English.

Net Searchers to Index All 800 Million Pages

More From the Los Angeles Times