Sep 21, 2021 - Technology

Giving AI-generated voices human-like emotion

Illustration of people with megaphones yelling over an AI robot.

Illustration: Aïda Amer/Axios

A startup is developing AI voices that can be edited to present different emotional intonations.

Why it matters: The voiceover industry — including everything from video games to audiobooks — stands to be disrupted if tech companies can deliver AI-generated text-to-speech voices that can truly mimic human speech.

What's happening: Bay Area-based AMAI is coming out of stealth with a $600,000 funding round from the VC firms Joint Journey and NRG Ventures, Axios can report first.

  • AMAI, which makes AI-based text-to-voice engines, will use the funding to enter the U.S. market, says co-founder and CEO Pavel Osokin.

How it works: AMAI says its AI-generated voices are virtually indistinguishable from human speech, and can be modulated to "select different emotions like happiness, anger or more" through a web editor that requires no coding, says Osokin.

  • He notes that just three percent of written books end up recorded as audiobooks, but that AMAI's technology will allow authors and publishers to create audio versions cheaply, "without needing to hire a voiceover actor."

What they're saying: "Voice is going to be the next internet revolution playing out over the next 10 to 20 years," says Osokin.

The catch: Based off its demos, AMAI's AI voices still have a mechanical tinge, and while different emotions do come through, they're not yet at the level of a professional voice actor.

  • Yes, but: As the deep learning technology that undergirds AI voice generation improves, so will the quality of the product, until it becomes virtually impossible to distinguish mechanically-generated speech from an actual human speaking — something that's already occurring with big AI text generation models like GPT-3.
Go deeper