Advertisement

Net Searchers to Index All 800 Million Pages

Share
TIMES STAFF WRITER

Stung by criticism that search engines have fallen hopelessly behind in indexing the 800 million pages of the World Wide Web, several search companies have launched themselves on a Herculean effort to scan and review the entire expanse of cyberspace.

Excite@Home, which operates Excite, the third-most popular search engine, Monday announced plans to look at the Web’s entirety using a new technology that will be deployed in the next few weeks. Excite now has indexed only about 50 million pages of the Web.

But some critics suggest that all this effort may be nothing more than a massive waste of resources--essentially a marketing scheme that will mean little to the average user and may even be counterproductive by vastly expanding the number of irrelevant results on a search request.

Advertisement

“What does it mean to have another 100,000 or 200,000 links show up in a search?” asked Jakob Nielsen, co-founder of Nielsen Norman Group, a Web usability consulting firm. “It is 100% irrelevant.”

Still, the push to become the biggest search engine in cyberspace has already begun to gain momentum, driving a variety of companies into the fray.

“The whole idea of bigger is better is back with a vengeance,” said Danny Sullivan, editor of Search Engine Watch.com, a London-based online magazine dedicated to the online search industry.

Norwegian search engine company Fast announced Monday that it plans to catalog all of the Web within the next year. The company also claims to be the current index champion at more than 200 million Web pages.

Inktomi Corp., which produces one of the most widely used search engines on the Internet, said that it too has begun to feel the pressure to keep up.

“We’ve seen a resurgence of the idea: big, big, big,” said Kevin Brown, director of marketing for Inktomi. “Relevance of results is still the leading issue, but we intend to grow our index substantially too.”

Advertisement

While part of the movement may be just an effort to gain bragging rights in a highly competitive industry, the current arms race between search engine companies touches on a Holy Grail of the Internet--cataloging the entirety of humanity’s online knowledge.

So far, the search engines have done miserably in the task. A study by scientists at the NEC Research Institute found that even the best search engines today have found no more than 16% of all Web pages.

The study, published in July in the scientific journal, Science, raised the unsettling question of whether the Internet could actually lead to a step backward in the distribution of knowledge as more information was lost than gained because of the inability of the search engines to keep up.

The scientists found that most search engines index less than 10% of the Web. Even by combining the efforts of all the search engines, only 42% of the Web had been indexed.

Kris Carpenter, director of search products and services for Excite, said she believes that most consumers still do not want all 800 million pages of the Web--a large percentage of which consist of vanity sites or extremely obscure data.

But she added that it has become more important to at least scan the entire Web so the search engines can make better decisions on what is important.

Advertisement

Most search engines use programs known as “spiders” to search out new Web sites and monitor those that already have been indexed.

Excite now uses fewer than 10 spiders to cover the Internet, but with its new technology, it will begin deploying dozens--each capable of covering up to 35 million pages a day.

Currently, spiders index virtually all of the pages they visit. Carpenter said the new system will distill the entire Web down to about 250 million pages that meet automated standards, such as being widely linked to other pages, and display results to search requests only from that smaller list.

“You can have an exponential increase in the amount of known content,” Carpenter said. “Visiting the whole Web is what makes the difference.”

Nielsen countered that all this talk of growing huge indexes makes no sense given the current state of search engines, which already tend to overload users with hundreds, if not thousands or tens of thousands of useless Web sites.

“The only thing that matters is the top 10 links you get back,” he said. “Maybe you click on the next page of results, but that’s rare.”

Advertisement

Nielsen said the problem with automated search engines isn’t so much their reach, but an inability to make even the most basic decisions on what is relevant, important and worthwhile.

Humans can do this easily, as shown by the growing popularity of directories--categorized lists of Web sites selected by human editors.

Nielsen said that search engine companies already went through a “bigger-is-better” phase, until the Web began to become unwieldy. Those companies temporarily abandoned that approach when it became clear that consumers simply wanted better results, not more.

Brown of Inktomi said that having huge databases can even make things worse by diluting a pool of good Web sites with hundreds of millions of obscure pages.

“Without improved relevancy, you can hurt the user,” he said. “Honestly, when you’re above 100 million pages, there’s just so much there already.”

* THURSDAYS IN CUTTING EDGE: Your Internet Guide spotlights recommended Web sites.

Advertisement