'Speech Engines' Getting Better at Listening

L.A. Times Archives

Jan. 15, 2001 12 AM PT

Share via
- Email
- Facebook
- X
- LinkedIn
- Threads
- Reddit
- WhatsApp

From Associated Press

NEW YORK — In the movies, computers are always good listeners. In real life, they hear only what they want to hear.

Much as people would like to speak with their machines, to browse the Internet by voice rather than keystroke, recent strides in speech recognition technology hardly provide the ease and spontaneity of a free-flowing dialogue between humans.

Instead, the machines monopolize the discussion.

While the latest speech engines can recognize spoken words with better than 90% accuracy, a vast improvement from only five years ago, the machines dictate the specific words and phrases that users can say. They ignore any commands that stray.

Still, even with a scripted dialogue, the allure of “voice browsing” is strong, especially for those trying to stay connected on mobile phones and hand-held computers with tiny keypads and screens. For drivers, the attraction is even greater.

In less than three months, more than 200,000 America Online members have signed up for AOLbyPhone, one of several new “voice portals” whose recorded voices read small nuggets of online information to callers in response to a set of spoken commands.

Palm computer users are also showing interest.

In a survey of Palm users, “about 36% acknowledged that they used the Palm and the cell phone as they were driving,” said Tom O’Gara, chief executive of MobileAria, a voice portal for cars due to be launched in June by Palm and Delphi Automotive Systems.

“We suspect the number is higher,” he added.

Although speech technology players such as IBM, Nuance Communications and SpeechWorks International are working to develop more conversational systems, none expect a major breakthrough any time soon.

Comprehension a Stumbling Block

“Natural language understanding is one of those Holy Grail areas,” said Bill DeStefanis, senior director of product management for Lernout & Hauspie, a leader in dictation software and text-to-speech technologies, which are used to make computers read aloud.

The main obstacle for natural speech technology involves simple brute power.

While huge gains in processing speeds have helped computers tackle the listening part of the equation and understand specific words, it requires a lot more firepower to make that machine comprehend the countless combinations of words used to express thoughts.

Such constraints are less of a hindrance with computer dictation software, in which the main goal is to recognize spoken words and transform them into type.

“You can try to anticipate how many ways somebody would recite a request and hard-code them into the software, but inevitably you’re going to miss some,” said DeStefanis at L&H;, which is currently pitching a new speech system for navigating hand-held computers--a much more manageable task than trying to master the entire dictionary. “If you limit the domain that you have to understand, you can increase accuracy.”

That’s why the most effective use of voice-activation technology has been with automated telephone systems that provide customer service for businesses like airlines and credit card companies. Because they are usually designed for a specific purpose, those speech engines can be customized to understand a more select vocabulary, even if those words are spoken in different combinations.

“Context is very important,” said Steve Ehrlich, vice president of marketing at Nuance, which has designed systems for the brokers Charles Schwab and Fidelity Investments. “At Schwab, the need for speech system that can recognize ‘I need a pizza’ is not crucial.”

The approach is similar at most of the new telephone-based voice portals, which also include HeyAnita, BeVocal, Tellme and Virtual Advisor, a driver-oriented service just launched in certain markets by OnStar.

Simplicity Yields Best Results

While the exact focus and presentation varies, all the services keep it simple, limiting their scope to specific matters such as news, weather, e-mail, driving directions and movie listings.

“The coolest thing is being able to get my e-mail read to me, like I’m the queen,” said Jen Bekman, 31, an AOLbyPhone user in New York who develops Web content for streaming video.

Users may struggle until they grow familiar with the “approved” commands for navigating the various menus, but the speech recognition on these services is rather impressive.

Tellme and BeVocal, for example, sparkled when put to the test with a series of potential tongue-twisters delivered by cell phone, ably identifying the spoken names of esoteric New York towns like Mamaroneck and Wurtsboro.

Michael Lambert, a 26-year-old librarian and BeVocal user in Foster City, Calif., occasionally struggled with the voice recognition at first, but says the accuracy has improved to the point where there are no problems.

“In the beginning, I used to wonder what’s going on. I’m from South Carolina, so I wondered if it was an accent thing,” said Lambert, who finds the service especially useful in his car. “I just moved here in May, so it’s very useful to just pick up a cell phone in the car and get directions.”

By contrast, most attempts to voice-navigate the full-blown Internet on a personal computer are fraught with frustrations. Because each Web site is designed differently, it’s difficult to pack that much flexibility into a single program on a PC.

A more realistic way to “voice-enable” the Internet, said Ira Brodsky, an industry analyst for Datacomm Research, might be to customize each Web site with the appropriate vocabulary.

“All you would have to do with a typical PC is have a microphone,” Brodsky said. “It also puts the Web site in a position to use personalization technology and come to recognize your voice. It may identify your voice and remember what you did the last time you were there.”

The most prominent voice browsers for PCs, including Conversay and Ivan, employ many of the same basic techniques to roam Web sites, numbering the various links on a Web page so a person can “click” the number orally.

But the makers of Ivan, One Voice Technologies, also decided to take a crack at natural language, weaving some “artificial intelligence” into its browser so it can recognize concepts in addition to specific commands.

However, all those bells and whistles make Ivan a bloated, sluggish program, a problem Conversay avoids. Either way, both frequently make navigation mistakes.

Despite the current focus on speech-activated systems, the industry isn’t shying from more ambitious projects.

Nuance, for example, recently partnered with Ask Jeeves, the Internet search engine that invites users to type in requests with whole sentences rather than a series of keywords.

“We’ll be trying to leverage the natural language data that they capture on their Web sites with all these people typing in questions, and build ‘speech models’ based on that data,” said Ehrlich at Nuance.

On the Net:

Datacomm Research report: https://www.datacommresearch.com/voicepr.html

Cahners In-Stat Group report: https://www.instat.com/pr/2000/wp0006sp--pr.htm