Feb 13, 2024 - Technology

New AI polyglot launched to help fill massive language gap in field

Illustration of a large dictionary with AI embossed on the front

Illustration: Natalie Peeples/Axios

A new open-source generative AI model can follow instructions in more than 100 languages.

Why it matters: Most models that power today's generative AI tools are trained on data in English and Chinese, leaving a massive gap of thousands of languages — and potentially limiting access to the powerful technology for billions of people.

Details: Cohere for AI, the nonprofit AI research lab at Cohere, on Tuesday released its open-source multilingual large language model (LLM) called Aya.

  • It covers more than twice as many languages as other existing open-source models and is the result of a year-long project involving 3,000 researchers in 119 countries.

How it works: The team started with a base model pre-trained on text that covered 101 languages and fine-tuned it on those languages.

  • But they first had to create a high-quality dataset of prompt and completion pairs (the inputs and outputs of the model) in different languages, which is also being released.
  • Their data sources include machine translations of several existing datasets into more than 100 languages, roughly half of which are considered underrepresented — or unrepresented — in existing text datasets, including Azerbaijani, Bemba, Welsh and Gujarati.
  • They also created a dataset that tries to capture cultural nuances and meaningful information by having about 204,000 prompts and completions curated and annotated by fluent speakers in 67 languages.
  • The team reports Aya outperforms other existing open-source multilingual models when evaluated by humans or using GPT-4.

The impact: "Aya is a massive leap forward — but the biggest goal is all these collaboration networks spur bottom up collaborations," says Sara Hooker, who leads Cohere for AI.

  • The team envisions Aya being used for language research and to preserve and represent languages and cultures at risk of being left out of AI advances.

The big picture: Aya is one of a handful of open-source multilingual models, including BLOOM, which can generate text in 46 languages, a bilingual Arabic-English LLM called Jais, and a model in development by the Masakhane Foundation that covers African languages.

What to watch: "In some ways this is a bandaid for the wider issue with multilingual [LLMs]," Hooker says. "An important bandaid but the issues still persist."

  • Those include figuring out why LLMs can't seem to be adopted to languages they didn't see in pre-training and exploring how best to evaluate these models.
  • "We used to think of a model as something that fulfills a very specific, finite and controlled notion of a request," like a definitive answer about whether a response is correct, Hooker says. "And now we want models to do everything, and be universal and fluid."
  • 'We can’t have everything so we will have to chose what we want it to be."
Go deeper