Apr 3, 2019 - Technology

Defending against audio deepfakes before it's too late

Illustration of a mouth glitching

Illustration: Sarah Grillo/Axios

Big Tech, top university labs and the U.S. military are pouring effort and money into detecting deepfake videos — AI-edited clips that can make it look like someone is saying something they never uttered. But video's forgotten step-sibling, deepfake audio has attracted considerably less attention despite a comparable potential for harm.

What's happening: With video deepfakes, defenders are playing the cat to a fast-scurrying mouse: AI-generated video is getting quite good. The technology to create audio fakes, by contrast, is not as advanced — but experts say that's soon to change.

  • "In a couple years, having a voice [that mimics] an individual and can speak any words we want it to speak — this will probably be a reality," Siwei Lyu, director of SUNY Albany's machine learning lab, tells Axios.
  • "But we have a rare opportunity before the problem is a reality when we can grow the forensic technology alongside the synthesis technology," says Lyu, who participates in DARPA's Media Forensics program.

Why it matters: Experts worry that easily faked but convincing AI impersonations can turn society on its head — running rampant fake news, empowering criminals, and giving political opponents and foreign provocateurs tools to sow electoral chaos.

  • In the U.S., fake audio is most likely to supercharge political mayhem, spam calls and white-collar crime.
  • But in places where fake news is already spreading disastrously on Telegram and WhatsApp (think India or Brazil), a persuasive tape of a leader saying something incendiary is especially perilous, says Sam Gregory of Witness, a human-rights nonprofit.

There are two main ways to use AI to forge audio:

Detecting audio deepfakes requires training a computer to listen for inaudible hints that the voice couldn't have come from an actual person. Lyu and UC Berkeley's Hany Farid are researching automated ways to do this.

  • Google recently made a vast dataset of its own synthetic speech available to researchers who are working on deepfake detection. This trove of training data can help AI systems find and recognize the hallmarks of fake voices.
  • For an international competition, 49 teams submitted deepfake detectors trained with Google's contribution, plus voices from 19 other sources in various languages. The top entrants were highly accurate, said competition co-organizer Junichi Yamagishi, a researcher at Japan's National Institute of Informatics. The best system only made mistakes 0.22% of the time, he tells Axios.

Pindrop, an Atlanta company that sells voice authentication to big banks and insurance companies, is also developing defenses, worried that the next wave of attacks on its clients will involve deepfake audio.

  • One key to detecting fakes, according to the company: sounds that seem normal, but that people aren't physically capable of making.
  • An example from Pindrop CEO Vijay Balasubramaniyan: If you say "Hello, Paul," your mouth can only shift from the "o" to "Paul" at a certain speed. Spoken too fast, "the only way to say this is with a 7-foot-tall neck," Balasubramaniyan says.

The bottom line: If deepfake detectors can get out ahead of the spread of fake audio, they could contain the potential fallout. And, unlike with video, it looks like the defenders could actually keep up with the forgers.

Go deeper: Audio deepfakes are getting better — but they haven't made it yet

Go deeper