
Illustration: Sarah Grillo/Axios
Big Tech, top university labs and the U.S. military are pouring effort and money into detecting deepfake videos — AI-edited clips that can make it look like someone is saying something they never uttered. But video's forgotten step-sibling, deepfake audio has attracted considerably less attention — despite a comparable potential for harm.
What's happening: With video deepfakes, defenders are playing the cat to a fast-scurrying mouse: AI-generated video is getting quite good. The technology to create audio fakes, by contrast, is not as advanced — but experts say that's soon to change.
- "In a couple years, having a voice [that mimics] an individual and can speak any words we want it to speak — this will probably be a reality," Siwei Lyu, director of SUNY Albany's machine learning lab, tells Axios.
- "But we have a rare opportunity before the problem is a reality when we can grow the forensic technology alongside the synthesis technology," says Lyu, who participates in DARPA's Media Forensics program.
Why it matters: Experts worry that easily faked but convincing AI impersonations can turn society on its head — running rampant fake news, empowering criminals, and giving political opponents and foreign provocateurs tools to sow electoral chaos.
- In the U.S., fake audio is most likely to supercharge political mayhem, spam calls and white-collar crime.
- But in places where fake news is already spreading disastrously on Telegram and WhatsApp (think India or Brazil), a persuasive tape of a leader saying something incendiary is especially perilous, says Sam Gregory of Witness, a human-rights nonprofit.
There are two main ways to use AI to forge audio:
- Modulation, which changes the quality of a voice to make it sound like someone else — from male to female, or British to American, for example. Boston-area startup Modulate.ai does this, as have researchers from China's Baidu.
- Synthesis, in which AI speaks any phrase typed into a box with a specific voice — like Trump's, for example. Montreal's Lyrebird can do this, as can Adobe's yet-unreleased VoCo, which can also rearrange, add or subtract words in an existing recording to make it sound completely different.
- Listen to an AI voice impersonating Ellen DeGeneres.
Detecting audio deepfakes requires training a computer to listen for inaudible hints that the voice couldn't have come from an actual person. Lyu and UC Berkeley's Hany Farid are researching automated ways to do this.
- Google recently made a vast dataset of its own synthetic speech available to researchers who are working on deepfake detection. This trove of training data can help AI systems find and recognize the hallmarks of fake voices.
- For an international competition, 49 teams submitted deepfake detectors trained with Google's contribution, plus voices from 19 other sources in various languages. The top entrants were highly accurate, said competition co-organizer Junichi Yamagishi, a researcher at Japan's National Institute of Informatics. The best system only made mistakes 0.22% of the time, he tells Axios.
Pindrop, an Atlanta company that sells voice authentication to big banks and insurance companies, is also developing defenses, worried that the next wave of attacks on its clients will involve deepfake audio.
- One key to detecting fakes, according to the company: sounds that seem normal, but that people aren't physically capable of making.
- An example from Pindrop CEO Vijay Balasubramaniyan: If you say "Hello, Paul," your mouth can only shift from the "o" to "Paul" at a certain speed. Spoken too fast, "the only way to say this is with a 7-foot-tall neck," Balasubramaniyan says.
The bottom line: If deepfake detectors can get out ahead of the spread of fake audio, they could contain the potential fallout. And, unlike with video, it looks like the defenders could actually keep up with the forgers.
Go deeper: Audio deepfakes are getting better — but they haven't made it yet