Most of Web Beyond Scope of Search Sites
If searching the World Wide Web for that one nugget of information already seems like a bad trip into a quagmire of data, Internet researchers have bad news for you--the situation is only getting worse.
Even the most comprehensive search engine today is aware of no more than 16% of the estimated 800 million pages on the Web, according to a study to be published today in the scientific journal Nature. Moreover, the gap between what is posted on the Web and what is retrievable by the search engines is widening fast.
“The amount of information being indexed [by commonly used search engines] is increasing, but it’s not increasing as fast as the amount of information that’s being put on the Web,” said Steve Lawrence, a researcher at NEC Research Institute in Princeton, N.J., and one of the study’s authors.
The findings, which are generally undisputed by the search engine companies themselves, raise the specter that the Internet may lead to a backward step in the distribution of knowledge amid a technological revolution: The breakneck pace at which information is added to the Web may mean that more information is lost to easy public view than made available.
The study also underscores a little-understood feature of the Internet. While many users believe that Web pages are automatically available to the search programs employed by such sites as Yahoo, Excite and AltaVista, the truth is that finding, identifying and categorizing new Web pages requires a great expenditure of time, money and technology.
Lawrence and his co-author, fellow NEC researcher C. Lee Giles, found that most of the major search engines index less than 10% of the Web. Even by combining all the major search engines, only 42% of the Web has been indexed, they found.
The rest of the Web--trillions of bytes of data ranging from scientific papers to family photo albums--exist in a kind of black hole of information, impenetrable by surfers unless they have the exact address of a given site. Even the pages that are indexed take an average of six months to be discovered by the search engines, Lawrence and Giles found.
Quality, Not Quantity, Some Say
The pace of indexing marks a striking decline from that found in a similar study conducted by the same researchers just a year and a half ago.
At that time, they estimated the number of Web pages at about 320 million. The most thorough search engine in that study, HotBot, covered about a third of all Web pages. Combined, the six leading search engines they surveyed covered about 60% of the Web.
But the best-performing search engine in the latest study, Northern Light, covered only 16% of the Web, and the 11 search sites surveyed reached only 42% combined.
While Web surfers often complain about retrieving too much information from search engines, failing to capture the full scope of the Web would be to surrender one of the most powerful aspects of the digital revolution--the ability to seek out and share diverse sources of information across the globe, said Oren Etzioni, chief technology officer of the multi-service Web site Go2Net and a professor of computer science at the University of Washington.
Etzioni said the mushrooming size of the Web’s audience makes the gulf between what is on the Web and what is retrievable increasingly important.
“There is a real price to be paid if you are not comprehensive,” he said. “There may be something that is important to only 1% of the people. Well, you’re talking about maybe 100,000 people.”
Lawrence and Giles estimated the number of Web pages by using special software that searches systematically through 2,500 random Web servers--the computers that hold Web pages. They calculated the average number of pages on each server and extrapolated to the 2.8 million servers on the Internet.
By using 1,050 search queries posed by employees of the NEC Research Institute, a research lab owned by the Japanese electronics company NEC, they were able to estimate the coverage of all the search engines, ranging from 16% for Northern Light--a relatively obscure service that ranks 16th in popularity among similar sites--to 2.5% for Lycos, the fourth-most-popular search engine.
For search engine companies, the findings of the report were no surprise.
Kris Carpenter, director of search products and services for Excite, the third-most-popular search engine, said her company purposely ignores a large part of the Web not so much because of weak technology but because of a lack of consumer interest.
“Most consumers are overwhelmed with just the information that is out there,” she said. “It’s hard to fathom the hundreds of millions of pages. How do you get your head around that?”
Carpenter said millions of pages, such as individual messages on Web bulletin boards, make little sense to index.
Kevin Brown, director of marketing for Inktomi, whose search engine is used by the popular search sites HotBot, Snap and Yahoo, said that search companies have long been aware that they are indexing less and less of the Web. But he argued that users are seeking quality information, not merely quantity.
“There is a point of diminishing returns,” he said. “If you want to find the best Thai food and there are 14,000 results, the question isn’t how many returns you got, but what are the top 10.”
In fact, Brown said, the technology already exists to find all 800 million Web pages, although indexing that much would be costly.
The Future of Search Engines
Inktomi, like most search engines, uses a method called “crawling” in which a program goes out onto the Internet and follows all the links on a known Web page to find new pages. The words on that new page are then indexed so that the page can be found when a user launches a search.
The crawling process helps the search engine compile an index made up of the most popular sites. This method ensures that high-traffic pages, such as those of the White House or CNN, could never go undiscovered.
Crawling can unearth an enormous number of new pages. Inktomi, for example, can record about 20 million pages a day, meaning that it could find all 800 million pages of the Web in less than two months.
But storing, searching and delivering that amount of information would require a daunting volume of computer storage and high-speed connections to the Internet.
He added that anyone who wants to be found can be found since most of the search engines allow people to submit their Web pages for manual inclusion in a search index. Commercial Web sites can also pay for prominent placement on some indexes.
Excite’s Carpenter said the future of search engines lies not in bigger indexes but more specialized ones in which everything on a given subject, such as baseball, could be indexed and displayed.
“You may be covering a huge percentage of the Web, but you’re presenting it in smaller slices,” she said. “Lumping everything into one big, be-everything index would be incredibly overwhelming.”
Lawrence also believes that indexing technologies will eventually enable the search engines to start gaining on the proliferating data.
NEC, for example, has been developing a so-called “meta-search engine” named Inquirus that combines the search ability of all major engines, then lists their results.
“I’m pretty optimistic that over a period of years the trend will reverse,” he said. But he added, “The next 10 to 20 years could be really rough.”