A new open-source generative AI model can follow instructions in more than 100 languages. Why it matters: Most models that power today's generative AI tools are trained on data in English and Chinese, leaving a massive gap of thousands of languages — and potentially limiting access to the powerful technology for billions of people.

Details: Cohere for AI, the nonprofit AI research lab at Cohere, on Tuesday released its open-source multilingual large language model (LLM) called Aya.

It covers more than twice as many languages as other existing open-source models and is the result of a year-long project involving 3,000 researchers in 119 countries.

How it works: The team started with a base model pre-trained on text that covered 101 languages and fine-tuned it on those languages.

But they first had to create a high-quality dataset of prompt and completion pairs (the inputs and outputs of the model) in different languages, which is also being released.

Their data sources include machine translations of several existing datasets into more than 100 languages, roughly half of which are considered underrepresented — or unrepresented — in existing text datasets, including Azerbaijani, Bemba, Welsh and Gujarati.

They also created a dataset that tries to capture cultural nuances and meaningful information by having about 204,000 prompts and completions curated and annotated by fluent speakers in 67 languages.

The team reports Aya outperforms other existing open-source multilingual models when evaluated by humans or using GPT-4.

The impact: "Aya is a massive leap forward — but the biggest goal is all these collaboration networks spur bottom up collaborations," says Sara Hooker, who leads Cohere for AI.

The team envisions Aya being used for language research and to preserve and represent languages and cultures at risk of being left out of AI advances.

The big picture: Aya is one of a handful of open-source multilingual models, including BLOOM, which can generate text in 46 languages, a bilingual Arabic-English LLM called Jais, and a model in development by the Masakhane Foundation that covers African languages.

What to watch: "In some ways this is a bandaid for the wider issue with multilingual [LLMs]," Hooker says. "An important bandaid but the issues still persist."