Beyond Anthony Bourdain AI voice deepfakes: The CGI of audio

By Matt PearceStaff Writer

July 26, 2021 5 AM PT

Share via
- Email
- Facebook
- X
- LinkedIn
- Threads
- Reddit
- WhatsApp

The most important thing about a documentary deepfaking Anthony Bourdain’s voice isn’t that it happened, but that it happened and almost nobody noticed.

Director Morgan Neville faced skepticism and outright revulsion on social media this month when it was revealed he used artificial intelligence to create a model of Bourdain’s voice for 45 seconds of narration in the new documentary “Roadrunner,” about the life and 2018 death by suicide of the beloved chef and journalist.

Bourdain’s voice was one of his trademarks, known to fans the world over from his TV travelogues “Parts Unknown” and “No Reservations.” Fans also loved how authentic he seemed, always able to level with the viewer. Faking his voice, to some, was a step too far.

“In the end I understood this technique was boundary-pushing,” Neville said. “But isn’t that Bourdain?”

Yet the boundaries have already been pushed far beyond Bourdain’s legacy or the mere confines of documentarian ethics. The voice imitation revolution is already here, and artists, technologists and companies in several industries who use the new tech are grappling with the big question of what happens when you separate speech from the speaker.

Need a synthetic voice that can read text for the visually impaired? A human voice actor can’t preread every possible sentence in the world but an AI-built voice could cope. Have a video game that’s been in interminable production for years and want to avoid hauling in voice actors for rerecording every time there’s a script change? Tweak their dialogue in production.

“There are endless possibilities. We believe this is the CGI of audio,” says Zeena Qureshi, co-founder and CEO of Sonantic, a start-up formed in 2018. “We made the first AI that can cry last year. We made the first AI that can shout early this year.”

Potential commercial uses abound. Sonantic’s website touts AI voices with “STUNNING REALISM, CAPTIVATING EMOTION” that can “deliver compelling, lifelike performances for games and films with fully expressive AI-generated voices.” It also promises to “reduce production timelines from months to minutes by rapidly transforming scripts into audio.”

Another synthetic-voice company, VocaliD, pitches to potential corporate clients that “the volume and speed with which written content must be transformed into brand-consistent sound bytes cannot be met by traditional voice talent or generic text-to-speech.” It imagines scenarios “you need audible content fast, but your voice talent isn’t available. Don’t miss the mark by delivering flat or generic sounding content that doesn’t foster the connection you’ve built with your audience.”

A third company, Resemble AI, offers services like voice “cloning” and has short clips of synthetic speech from former President Barack Obama and actors Morgan Freeman and Jon Hamm (though potential clients are forbidden from cloning the voices of celebrities without permission). The company says a voice clone can start to be built if it has 50 sentences from a real speaker to synthesize.

Like any form of automation, the promises are simple: Robots can do more work faster, and the money and time saved can be used on something else. And like any form of automation, there can be big downsides for the humans whose work and paychecks are getting augmented or replaced.
“Voice actors, their voice is intellectual property, it’s their own, or at least that’s the idea,” says David Rosenthal, CEO and coach at the Global Voice Acting Academy. But he warns against companies forcing performers to sign agreements that allow those companies to synthesize performers’ voices and use them for whatever they want in perpetuity. “They can’t say, ‘Oh, now I own this voice because you did a job for me.’ ... With AI, it’s unfortunately kind of like the Wild, Wild West here.”

Some voice acting advocates have been watching the lawsuit filed by Canadian voice-over performer Bev Standing, who alleged that TikTok took her voice for a text-to-speech feature for users’ videos without notice or compensation.

“It’s unmistakable, because it’s my voice. It’s the voice I’m pretty much talking to you now with,” Standing says in a phone interview. “How it got to TikTok, I don’t know.”

Her lawsuit, filed in federal court in New York, says she had originally performed the voice-over work for a Scottish company. She says she did not have a contract permitting the sale of her voice to another company.

“Although the voice and likeness are Plaintiff, the TikTok user is able to determine what words are spoken in Plaintiff’s voice and some videos depicting Plaintiff’s voice have involved foul and offensive language,” Standing’s lawsuit says.

Attorneys for TikTok parent company ByteDance Ltd. have signaled they will argue the lawsuit should be dismissed for a variety of technical reasons. “TikTok is a free platform,” the attorneys wrote in a letter to the judge requesting a hearing, saying that the text-to-speech function “makes videos more accessible to disabled users.”

They also argued that Standing’s voice just isn’t recognizable enough to argue her likeness was stolen: “Plaintiff has not sufficiently alleged that her voice is recognizable to the public as associated with her, and a cursory review of any TikTok video using [text-to-speech] shows that (as she alleges) it is a neutral, generic ‘female computer generated voice.’”

Further, the defense wrote, “Given that the words spoken in Plaintiff’s alleged ‘female computer voice’ in any video are selected by the user ... it is implausible that the public would conclude that Plaintiff personally endorses every video employing the [text-to-speech] function.” A TikTok spokesperson declined to comment.

After Standing’s lawsuit started getting media coverage, “I got a number of emails and direct messages through my website, through social media, with really negative comments about why I won’t let TikTok just use the voice: ‘It’s just a voice,’ I should ‘rot in hell.’ A lot of other really nasty comments,” Standing says. “I really am a small business. It’s just me. I’d love the unions to step up and get more involved in this procedure.” Standing says her work with the Scottish company was not protected by a collective-bargaining agreement.

Some companies have preemptively reached out to SAG-AFTRA, which represents some voice-over performers, about setting up fair compensation systems for performers whose voices will be re-created with AI. At other companies, however, “A lot of nonunion performers signed away rights that they had no idea about,” says Katie Watson, the union’s national director for voice-over contracts.

SAG-AFTRA contracts for “very low-budget productions” require that digital re-creations of performers can’t be used “without coming to us for our consent,” says Danielle Van Lier, an assistant general counsel for the union who focuses on intellectual property and contracts.

The union says protections are needed. Many of the earliest uses of such deepfake technology were “nonconsensual and exploitative in nature” and often targeted female members.

“The AI-generated voices, in particular, can be used to put words into our members’, including our broadcast journalists’, mouths and make them say things they never said,” the union said in a statement. “Whether done for entertainment purposes, commercial purposes, or malicious purposes, this practice is potentially harmful. At best, it denies them the right and ability to control their image; at the other end, it is exploitative and may cause actual harm to their reputation, their earning potential, or worse, to the individual themselves.”

There are also potential benefits of AI voice technology for performers. The longest actors’ strike in SAG-AFTRA’s history came against video game companies in 2016-17; performers were protesting, among other issues, voice-over work for games that required screaming and grunting that can damage vocal cords and livelihoods. Digitally augmenting or otherwise replicating performers’ voices to achieve those effects could theoretically lessen the strain.

“In gaming, there’s a lot of shouting, so their AI can shout for them 24/7 without them even losing their voice,” says Qureshi, who notes that her company predominantly works with gaming studios and only synthesizes the voices of actors, who record and “train” their own AI models. “We’re not replacing actors. We augment how they work.”

Qureshi says that her company has implemented a compensation system for performers whose voices have been synthesized so that “every time their voice gets used, they get a profit share.” She also raises the prospect that the actors’ AI voices could essentially allow performers to do multiple projects at once. “If they want to work in theater, they want to work somewhere very niche, [and] their AI can work for them on the side doing voice work for games and films and things like that.”

As a voice coach, Rosenthal thinks that AI technology has not quite advanced to the point where it could trick a listener over a long period of time with a fully synthetic voice. “Anything to do with animation, cartoons, video games, that kind of stuff, there’s a certain kind of physicality involved in those particular genres, not just in a human vocal ability, but also a physicality that AI has not mastered at all yet,” Rosenthal says. “Imagine fighting while you’re talking, punching or receiving a punch.”

But he sees the writing on the wall, as the technology will only get better — and more tempting to use. “AI is really cheap, right? Costs less than having to pay a person,” Rosenthal said. “Ultimately it’s our job to educate those people as to the value of the human voice.”

AI deepfakes of Anthony Bourdain’s voice are only a taste of what’s coming

More to Read

More From the Los Angeles Times

Most Read in Entertainment & Arts