Gains in Translation: Software Aims to Cut Through the Babble Better

By David Colker

May 15, 2005 12 AM PT

Times Staff Writer

In the 23rd century, people will use a gizmo that looks suspiciously like a flashlight to communicate with alien species.

That’s in the world of “Star Trek.” In the here and now, a small Marina del Rey company is working on technology that could lead to a “universal translator” for the real world.

Language Weaver Inc. was founded in 2002 by two USC computer scientists who developed methods to teach computers to translate by force-feeding them huge volumes of text. Early investors included the CIA.

Though portable translation devices are only a long-range goal, the company’s technology is already being used, along with that of a voice-recognition software firm, to do on-the-fly translations of television programs.

If successful, such projects could help the company win a piece of the growing global translation business. According to a 2002 survey by market research firm ABI Research, revenue worldwide was $7.9 billion that year and will reach $11.5 billion by 2007.

Despite decades of development, computerized translation accounts for only a small fraction of that market. IDC, a research firm that specializes in technology matters, estimates that computer language-translation sales will be $187 million this year.

Language Weaver uses a broadcast from the Arabiclanguage television network Al Jazeera to demonstrate its technology. As a news show from the network is displayed on one part of a computer screen, Arabic text derived from a BBN Technologies voice-recognition program runs down one side. Next to it appears Language Weaver’s English text.

The end result is not something that would make human translators fear for their jobs: “Research conducted by parents Muslims in Iraq and changing realities of initial options,” reads the English translation, “on the ground that it would boycott the elections today to start new staff....”

“This is still an early stage,” said Language Weaver Chief Executive Bryce Benjamin.

Indeed, computers will not be able to match human translators for many years, experts say.

“Bringing quality to computerized translations over a broad range of materials is akin to achieving artificial intelligence,” said Robert Frederking, senior systems scientist at Carnegie Mellon University’s Language Technologies Institute in Pittsburgh. “Language is intrinsically tied to what we do as humans. To solve the big problems in language, you have to solve A.I.”

But Language Weaver and its competitors -- such as Paris-based Systran, which provides multi-language translation technology for Google Inc., Time Warner Inc.’s America Online and Yahoo Inc. -- have reached a level of accuracy that is good enough for many practical uses. This is particularly true when working with text, without the added variable that voice recognition introduces.

Benjamin said most of the company’s customers are intelligence, Department of Defense and law enforcement agencies. Aside from the National Virtual Translation Center -- a joint effort of the CIA, the FBI and other federal agencies -- he declined to name specific entities.

In intelligence work, “the tremendous amount of material that has to be examined far outstrips the number of translators available,” Benjamin said. “If you can use computer software to get at least some clues in your own language, you’re way ahead.... It doesn’t need to be perfect.”

Although Language Weaver’s primary customers are governmental, the company, which has about 30 employees, also offers products to businesses to translate Arabic, Chinese, French and Spanish into English and vice versa. Its products also can translate Hindi and Somali one way, into English.

Language Weaver’s software runs from $20,000 to $100,000 per language, depending on the obscurity of the tongue -- far more than more traditional rule-based translation programs.

“The edge we have is the high rate of accuracy on text,” said Kevin Knight, one of the company’s founders.

The birth of computer translation is generally set at Jan. 8, 1954, when news reporters were invited to watch a huge IBM Corp. computer turn simple Russian sentences, typed in by an operator, into English. Glowing accounts appeared the next day, and one of the heads of the project, Leon Dostert (Gen. Dwight D. Eisenhower’s interpreter during World War II) was quoted as saying that within five years, electronic translation would be far enough advanced to handle complex passages.

Although the prediction was off, the methods used would be the mainstay of computer translation for decades. It involved the arduous task of feeding into a computer a translation dictionary -- words and their equivalents in another language -- along with grammar and other rules used for sorting out syntax.

This rule-based method got vastly more sophisticated over time but was plagued by the fact that language rules seem made to be broken and by multiple meanings for many words.

In the 1980s, IBM worked on a an approach under which the computer didn’t bother with dictionaries or rules but instead analyzed texts and their human translations. Then, when presented with a text to translate, the computer would search its database for how each word had been used in context and make a choice based on a statistical analysis.

“It’s not really a translation in the traditional sense,” said Benjamin, whose firm licensed IBM’s technology and has further developed it. “What we create is a probability forecast based on looking at millions of pairs of translations for a word and then selecting the one with the highest probability of being right.”

IBM secured patents for what became known as the statistical method, but the approach remained painfully slow. More important, finding and digitizing enough already translated materials was a herculean task.

Then came the Web.

“Now we have mountains of new texts and their translations every day,” Frederking of Carnegie Mellon said.

An example of an online language mother lode is the BBC World Service, which presents news pages in 43 languages. And much of the content of the United Nations site is in the six official languages of the institution: Arabic, Chinese, English, French, Russian and Spanish.

Knight and Daniel Marcu, the founders of Language Weaver, found ways to improve on IBM’s technology. The two took out their own patents, and Language Weaver -- which hopes someday to go public -- says it is the only company offering purely statistical translation products.

In the academic community, arguments rage over the relative merits of rule-based and statistical approaches. Many researchers believe that rule-based translations give higher-quality results if they can be tailored to a particular industry or topic. The statistical method is seen by many as superior when dealing with a general text.

Most major researchers say that eventually, the best language software will employ a hybrid of rule-based and statistical techniques.

Systran, which was founded in 1968 and now offers products for more than 40 languages, is the best known of the rule-based providers. CEO Dimitris Sabatakakis said it was integrating statistical methods among its tools, but he said they were an extension of what the company was already doing.

“What is new is the availability of electronic content,” Sabatakakis said. “We now have crawlers on the Web, finding all the different ways to use a verb or a noun. But it is not a breakthrough.”

Benjamin said Language Weaver would play to its strengths by continuing to work on languages that have received little attention from older computer translation companies. He offered no prediction as to when the technology would lead to a hand-held unit that could translate speech into any language.

But “it’s what everyone is shooting for -- the ‘Star Trek’ universal translator,” Benjamin said. “That’s when we change the world.”

(BEGIN TEXT OF INFOBOX)

Word processing

Translation software by Language Weaver handled a sample text in Spanish less accurately than did a Web version of software by rival Systran. But both made mistakes -- stumbling, for example, over the name Nada Doumani. The sample is the introduction to an article on the website of the International Committee of the Red Cross.

* Original text in Spanish:

Hace dos anos que un sangriento conflicto afecta la region de Darfur, en Sudan. Hasta ahora, no se ha podido aportar una solucion duradera, y la poblacion sigue dependiendo, en gran medida, de la proteccion y la ayuda que prestan las organizaciones humanitarias. Nada Doumani, colaboradora del CICR, visito Darfur en marzo y comparte sus reflexiones con nosotros.

* Human translation by The Times:

For two years, a bloody conflict has affected the Darfur region in Sudan. Until now, a lasting solution has not been found, and the population continues to depend, to a great degree, on the protection and help offered by humanitarian organizations. Nada Doumani, an ICRC worker, visited Darfur in March and shares her reflections with us.

* Computer translation by Language Weaver:

Two years ago that a bloody conflict affects Darfur region in Sudan. until now, there has been able to make a lasting solution, and the population remains dependent, in large measure of protection and assistance to provide humanitarian organizations. Nothing ICRC, collaborators of ICRC, visited Darfur in march and shared their thoughts with us.

* Computer translation by Systran:

For two years a bloody conflict has been affecting the region of Darfur, in Sudan. Until now, it has not been possible to contribute a lasting solution, and the population continues depending, to a great extent, of the protection and the aid that the humanitarian organizations lend. Nothing Doumani, collaborator of the CICR, visited Darfur in March and shares its reflections with us.

Sources: International Committee of the Red Cross (www.icrc.org), Language Weaver Inc., Systran (www.systransoft.com). Times translation by staff writer Jennifer Delson.

Gains in Translation: Software Aims to Cut Through the Babble Better

More From the Los Angeles Times