The Cutting Edge / COMPUTING / TECHNOLOGY / INNOVATION : New Ways to Find Needle in Data Haystack : Information: Novel software is making the database search faster, more efficient.


In the summer of 1991, while using an on-line legal database to research cases on judicial review, Yale Law School student Daniel Egger joined the ranks of frustrated database searchers. He spent hours each day in front of a computer terminal, yet his searches were retrieving dozens of irrelevant documents while skipping some important cases altogether.

"I thought there's got to be a better way," Egger recalls. "What lawyers are really trying to do is find a line of cases that develop a particular legal idea over time. I realized I could do that with mathematical modeling techniques."

That brainstorm lead to V-Search, a program designed to comb databases more efficiently by using the relationships between documents to help find the ones that are useful.

V-Search is one of handful of novel new products that use concepts rather than "keywords" to help database users find information. While traditional database search tools use simplistic "true or false" principles to determine whether a keyword or specific phrase is contained in a document, the new breed of programs uses statistical analysis to identify key concepts and find the linkages between related documents.

These new programs are emerging in the nick of time. The quantity of information stored in electronic databases is growing by leaps and bounds, and doctors, lawyers, journalists, financial professionals and many others are increasingly dependent on them. And yet it is often impossible to find what amounts to a needle of information in an immense haystack of data.

"The basic problem is information overload," said Steven Fingerhood, general partner of SLF Partners, a San Francisco venture capital firm that has invested Egger's Durham, N.C.-based company, Libertech. "Everyone agrees it's one of the major problems that has to be solved for effective use of electronic information."

The new "search engines," as they are known, are a quantum leap ahead of those that concentrate on locating keywords, said Vinod Khosla, a partner in the venture capital firm Kleiner Perkins Caufield & Byers, which has invested in a Palo Alto-based search-engine company called Architext. Comparing the two is "like calling a car a bicycle," Khosla said. "Both of them get you from one place to another," but the car is far more advanced.

The market for software packages that help people access and sort through databases was $748 million last year, and is projected to grow to $960 million next year, according to International Data Corp., a market research firm based in Framingham, Mass. Fortune 1000 companies ranked "improved access to data" as their second-most important concern in a recent survey conducted by IDC.

The vendors of keyword-based search engines certainly aim to keep their share of the booming business. They are designing sophisticated interfaces that allow users to query databases in plain English, as well as providing pre-programmed thesauruses, so that if someone is looking for documents about the New York Stock Exchange, the search engine will also retrieve documents that refer to the exchange as the NYSE.

But the new engines promise a breakthrough in speed and accuracy. First, they read through each of the files in a database. By counting the number of times each of the words appears in the documents, and by noticing the other words that appear nearby, the search engines can discern the key concepts in a document. Then, using statistical analysis, the search engines compare the concepts in each of the documents to find ones that are closely related to each other.

"The main thing is to look for relationships between words and groups of words," explains Graham Spencer, senior scientist at Architext, which has developed a search engine similar to the one that powers V-Search. "It helps us pin down what a human thinks of when he thinks of a concept."

For example, when searching a database for information about intellectual property, a concept-based search engine will retrieve documents about piracy because the two concepts are closely linked, Spencer said. A typical search engine would only find piracy documents if they made an explicit reference to intellectual property.

That strategy works particularly well for databases of documents--like legal cases and technical journal articles--that refer to earlier documents.

"Our system (V-Search) analyzes the network of explicit links among documents and finds groups of closely related documents," Egger said. V-Search, which will be unveiled officially today at the Folio Infobase 95 conference in San Diego, has already been licensed to several major legal publishers, and Egger hopes to expand beyond the legal database market by the end of the year.

Open Data Corp., a software company in Lexington, Mass., another entry in the concept-based searching field, is aiming at large corporate databases with FindOut! 2.0. The program reads through text and data files and enables users to find documents that are closely related even if they are classified by the company in entirely separate ways, said Julie McNamara, Open Data's vice president of product management. The product is scheduled for release in April.

Architext, for its part, has made eight deals to license its searching software to media companies, information providers, CD-ROM vendors and Internet-related companies, president Joe Kraus said.

He and others believe that concept-based search engines will truly distinguish themselves when it comes to finding information on the World Wide Web, the ever-expanding collection of easily-accessible information on the Internet computer network.

"The Internet is a big, big play for us," Kraus said. "There is lots of information out there that needs to be searched and sorted."

Existing programs such as Mosaic help Net surfers navigate around the Internet, but only if they know where they want to go, said Marie Landis, an Internet consultant in Evanston, Ill. Concept-based search engines will help people find information on the World Wide Web even if they are not exactly sure what they are looking for.

"That's a hot topic right now, being able to know where the information is," Landis said. "I think we're going to see more of these search engines to help people find stuff among the vast information that's out there. That's what the Internet desperately needs."

Copyright © 2019, Los Angeles Times
EDITION: California | U.S. & World