Sep 8, 2023 - Technology

AI's language gap

Photo illustration of a grid of closely cropped photos of mouths with digitized overlays

Illustration: Natalie Peeples/Axios

AI's first language is English — a bias that researchers are racing to counter before it gets permanently baked into the new technology.

Why it matters: Most of today's generative AI tools are built on large language models (LLM) trained on texts and data in English and Chinese, leaving the 6 billion native speakers of the world's more than 7,000 other languages at risk of being left out as the technology reframes work, business, education, art and more.

  • "The languages that these models serve are going to be defaulted to as the easiest way to generate new information," says Sara Hooker, who leads Cohere for AI, Cohere's nonprofit AI research lab.
  • "And it's going to become much easier for that information to be amplified than it is for languages that these models don't serve."

How it works: Most of the data used to train the foundational models fueling the current wave of AI — for example, OpenAI's GPT models and Meta's Llama versions — is in English, and the AI tools they support perform best when asked questions in it.

  • GPT-4, the latest LLM from OpenAI, excels at English, Spanish, Italian, Indonesian and other Latin alphabet-based languages, but it struggles with Thai, Punjabi and other languages based on different alphabets. Baidu's Ernie Bot is best with Chinese, which it was trained on.
  • When it released its updated LLM model in July, Meta cautioned that because most of the training data for the model is in English, it "may not be suitable for use in other languages."
  • ChatGPT can translate prompts and responses into English well, but it often fumbles translating English into other languages. Languages such as French and Chinese, which are known as "high resource" languages and are well-represented in training data, are translated into English far better than Javanese and other "low resource" languages.
  • ChatGPT can also make up words, struggle with syntax and generate gibberish in many underrepresented languages, Andrew Deck writes for Rest of World, which tested the abilities of the free version of the chatbot released late last year. A newer version shows slight improvement with some languages for simple prompts but continues to struggle with more complicated requests, Deck notes.

What's happening: Some developers are trying to overcome these linguistic shortcomings by focusing on building multilingual large language models, while others are putting their efforts into tuning models to a particular language.

  • The Aya Project at Cohere is an open science project to build an AI model tuned with instructions in 100 languages (rather than focusing on a foundational model trained on unstructured text). Aya, which plans to release its model early next year, follows other open-source models, including BLOOM, which can generate text in 46 languages.
  • Inception, a UAE-based company, last week released Jais, a bilingual Arabic-English LLM. The Masakhane Foundation is working on AI systems that capture African languages.

Clibrain — a startup based in Madrid — in July released LINCE Zero, an LLM tuned to Spanish.

  • “Spanish is not one Spanish,” the company's CEO, Elena González-Blanco, tells Axios.
  • Clibrain is focused on capturing the nuances of the language, which has numerous dialects and variations spoken in 20 countries around the world.
  • Current demand for a Spanish language model is coming from the legal and communications sectors, González-Blanco says. The Barcelona Supercomputing Center has also released a Spanish LLM.

Yes, but: Building a model for every language isn't realistic, says Mona Diab, director of the Language Technologies Institute at Carnegie Mellon University. She's a proponent of multilingual models and sees promise in using them to capture families of languages.

  • For example, a model trained on Arabic from Tunisia, Egypt and Saudi Arabia, but not from Qatar, may still be able to respond to a prompt in Qatari Arabic dialect.
  • "You can can make up for the lack of pre-training data by knowing something about the language family," she says.

The big picture: AI's language bias reflects a broader lack of cultural awareness in AI systems.

  • For the languages that are included in training data, internet penetration tends to be in the upper echelons of these societies and not a reflection of the entire culture, Diab says.
  • The data has to be deliberately curated, taking into account the histories, politics and media landscapes of cultures, she adds.

Zoom in: Diab and her collaborators are currently comparing the views output from different language models — those trained in English but that can translate multilingual models and language-specific ones — to responses from multiple global surveys conducted on the ground in the Arab world.

  • "We're finding that the multilingual models are best," she says, but it depends on the question.
  • When it comes to politics, "most people feel more free to express themselves in English rather than Arabic," and models with English training perform better than a predominantly Arabic model. More social-oriented questions lean more toward those language-tuned systems.

The big question: Whose value systems and worldviews are imposed on these AI models as people try to flag potentially harmful speech?

  • "We hold everybody up to a certain yardstick, which is very Western-oriented, and maybe some of these things don't necessarily hold in different cultures in different settings," Diab says. "Barring human rights issues, are we being cognizant of what is of value to this community or that community and how that manifests itself in the way a language model responds to a certain query?"

What to watch: Researchers working on the Aya Project are now red-teaming eight of the languages in the model for safety, biases and other risks.

  • Native speakers are annotating the responses for toxicity, unsafe use, the prompting of bad financial advice, and other issues.
  • "All the red-teaming to date for these major model launches has been done primarily in English," Hooker says, adding that how to assess safety is an open research question.
  • "We report risk in just one language, but we're deploying technology all over the world."

Go deeper: Social scientists look to AI models to study human behavior (Axios)

Go deeper