Jul 27, 2024 - Science

This is AI's brain on AI

Alison Snyder

Illustration of copies of a human brain gradually becoming more distorted. — Illustration: Shoshana Gordon/Axios

Data to train AI models increasingly comes from other AI models in the form of synthetic data, which can fill in chatbots' knowledge gaps but also destabilize them.

The big picture: As AI models expand in size, their need for data becomes insatiable — but high quality human-made data is costly, and growing restrictions on the text, images and other kinds of data freely available on the web are driving the technology's developers toward machine-produced alternatives.

State of play: AI-generated data has been used for years to supplement data in some fields, including medical imaging and computer vision, that use proprietary or private data.

But chatbots are trained on public data collected from across the internet that is increasingly being restricted — while at the same time, the web is expected to be flooded with AI-generated content.

Those constraints and the decreasing cost of generating synthetic data are spurring companies to use AI-generated data to help train their models.

Meta, Google, Anthropic and others are using synthetic data — alongside human-generated data — to help train the AI models that power their chatbots.
Google DeepMind's new AlphaGeometry 2 system that can solve math Olympiad problems is trained from scratch on synthetic data.

New research illustrates the potential effects of AI-generated data on the answers AI can give us.

In one scenario that's extreme yet valid, given the state of the web, researchers trained a generative AI model largely on AI-generated data. The model eventually became incoherent, in what they called a case of "model collapse" in a paper published Wednesday in Nature.
The team fine-tuned a large language model using a dataset from Wikipedia, generated data from the AI model and then fed it back into the model to fine-tune it again. They did this repeatedly, feeding each new model data generated by the previous one.
They found the training data is polluted over the generations, eventually causing the model to respond with gibberish.
For example, it was prompted with text about medieval architecture and after nine generations was outputting text about jackrabbits.

How it works: The model starts to lose information about data that doesn't appear as often in the training set and eventually collapses from the number of errors introduced, the team writes.

The AI responded with "things that have no resemblance to reality," says Ilia Shumailov, a co-author on the paper, which was written when he was at the University of Oxford, told Axios.

Between the lines: Training with synthetic data carries particular risks for information from underrepresented groups of people or languages that don't appear often in a data set, Shumailov told Axios.

In another recent paper, he and other researchers tracked shifts in data over generations of models trained on synthetic data and found they could lead to a loss of fairness — even in datasets that were initially unbiased, they report.
It's likely "going to be harder to build models and harder to build fair models because the majority of the problems that we will experience are going to be experienced by minority data," Shumailov says.

Yes, but: AI-generated data can also be a powerful tool to address limitations in data.

New research shows how it can be tailored to specific needs or questions and then used to steer models' responses to produce less harmful speech, represent more languages or provide other desired output.

A team from Cohere for AI, Cohere's nonprofit AI research lab, recently reported being able to use targeted sampling of AI-generated data to reduce toxic responses from a model by up to 40%.
Shumailov and his colleagues performed "algorithmic reparation" by curating training data to improve fairness in models.

By molding and sculpting data in different ways, researchers might be able to achieve their goals with a smaller model because it is trained on a dataset with a specific objective in mind, says Sara Hooker, who leads Cohere for AI.

Instead of learning from synthetic data produced by one "teacher" model, AI can be trained on data strategically sampled from a community of specialized teachers, she says. That can help avoid "collapse" because the synthetic data comes from multiple sources.

The big questions are whether synthetic data can represent the breadth of humanity and its experience, and whether it can be used to surpass the best model out there, Hooker says.

"That's the crux of the discussion within the research community and it is very far from decided."

The intrigue: When 10% of the original human-generated data was retained, the model's performance didn't suffer, the team reports in the Nature paper.

Such data could be given more weight in training a model to protect it from collapsing, but it is currently difficult to tell real data from synthetic data, Shumailov says.

The bottom line: AI-generated data is "an amazingly useful technology, but if you use it indiscriminately, it's going to run into problems," Vyas Sekar, a professor of electrical and computer engineering at Carnegie Mellon University, told Axios.

"If used well, it can lead to really good outcomes," says Sekar, who is also co-founder and chief technology officer of Rockfish, a company that helps customers combine human- and AI-generated data for their specific needs.
"There's value for both real data and generative data in any use case."

Add Axios on Google

This is AI's brain on AI

What to read next