Axios Science

July 25, 2024

Alison Snyder

Welcome back to Axios Science. This week's newsletter is 1,672 words, about a 6½-minute read.

Send your feedback and ideas to me at [email protected].

1 big thing: AI's brain on AI

Illustration of copies of a human brain gradually becoming more distorted. — Illustration: Shoshana Gordon/Axios

The data that trains AI models increasingly comes from other AI models in the form of synthetic data, which can fill in chatbots' knowledge gaps but also destabilize them.

The big picture: As AI models expand in size, their need for data becomes insatiable — but high quality human-made data is costly, and growing limitation on the text, images and other kinds of data freely available on the web are driving the technology's developers toward machine-produced alternatives.

At the same time, the web is expected to be flooded with AI-generated content.
Those constraints and the decreasing cost of generating synthetic data are spurring companies like Meta, Google, Anthropic and others to use AI-generated data to help train their models. (See below.)

New research illustrates the potential effects of AI-generated data on the answers AI can give us.

In one scenario that's extreme yet valid, given the state of the web, researchers trained a generative AI model largely on AI-generated data. The model eventually became incoherent, in what they called a case of "model collapse" in a paper published yesterday in Nature.
The team fine-tuned a large language model using a dataset from Wikipedia, generated data from the AI model and then fed it back into the model to fine-tune it again. They did this repeatedly, feeding each new model data generated by the previous one.
They found the training data is polluted over the generations, eventually causing the model to respond with gibberish.
For example, it was prompted with text about medieval architecture and after nine generations was outputting text about jackrabbits.

How it works: The model starts to lose information about data that doesn't appear as often in the training set and eventually collapses from the number of errors introduced, the team writes.

Between the lines: Training with synthetic data carries particular risks for information from underrepresented groups of people or languages that don't appear often in a data set, Ilia Shumailov, who is a co-author on the paper, which was written when he was at the University of Oxford, told Axios.

In another recent paper, he and other researchers tracked shifts in data over generations of models trained on synthetic data and found they could lead to a loss of fairness — even in datasets that were initially unbiased, they report.
It's likely "going to be harder to build models and harder to build fair models because the majority of the problems that we will experience are going to be experienced by minority data," Shumailov says.

Yes, but: AI-generated data can also be a powerful tool to address limitations in data.

New research shows how it can be tailored to specific needs or questions and then used to steer models' responses to produce less harmful speech, represent more languages or provide other desired output.
A team from Cohere for AI, Cohere's nonprofit AI research lab, recently reported being able to use targeted sampling of AI-generated data to reduce toxic responses from a model by up to 40%.
Shumailov and his colleagues performed "algorithmic reparation" by curating training data to improve fairness in models.

The big questions are whether synthetic data can represent the breadth of humanity and its experience, and whether it can be used to surpass the best model out there, says Sara Hooker, who leads Cohere for AI.

"That's the crux of the discussion within the research community and it is very far from decided."

The intrigue: When 10% of the original human-generated data was retained, the model's performance didn't suffer, the team reports in the Nature paper.

Such data could be given more weight in training a model to protect it from collapsing, but it is currently difficult to tell real data from synthetic data, Shumailov says.

The bottom line: AI-generated data is "an amazingly useful technology, but if you use it indiscriminately, it's going to run into problems," Vyas Sekar, a professor of electrical and computer engineering at Carnegie Mellon University, told Axios.

2. A new AI math whiz

Illustration of a cube and equations written on a chalkboard. — Illustration: Lindsey Bailey/Axios

Two AI systems from Google DeepMind together solved four of the six problems in this year's International Mathematical Olympiad — on par with silver medalists in the annual world math championship for high school students.

Why it matters: The ability to solve a range of math problems in step-by-step proofs is considered a "grand challenge" in machine learning and has been beyond the reach of current state-of-the-art AI systems.

How it works: AlphaProof teaches itself by trial and error — without human intervention — in what's known as reinforcement learning.

The team first fine-tuned Google's Gemini model to translate 1 million mathematics problem statements from English into a programming language called Lean, Thomas Hubert, a research engineer at DeepMind, said in the briefing.
The problems, which ranged in difficulty, were then given to AlphaProof so it could generate potential solutions that it then checked against possible proof steps.
Those that worked are then fed back into the model, which improves as it attempts more problems.

AlphaProof solved three of this year's math olympiad problems — two algebra problems and one in number theory.

One was solved in minutes and the others took up to three days. (Students have two 4.5-hour sessions to submit answers.)
It couldn't crack two combinatorics problems.

The other member of the AI team, AlphaGeometry 2, solved the competition's geometry problem in 19 seconds.

There is very little data available to train math-focused AI models, so the DeepMind team used synthetic data generated by AI itself to train AlphaGeometry 2.
The system can solve 83% of math olympiad problems from the past 25 years compared to its predecessor, which could solve 53%, the company said.

Overall, the AI systems scored 28 out of 42 possible points — putting them in silver-medal territory and one point shy of the gold-medal threshold, the company said.

Read the entire story.

3. NSF launches new research security centers

Photo illustration of a man wearing safety goggles and security earpiece — Illustration: Lindsey Bailey/Axios

The National Science Foundation this week announced it is investing $67 million over five years to support centers around the U.S. to provide researchers with tools to identify foreign interference in their work.

Why it matters: The U.S. research community is trying to balance the risks of interference and the rewards of international collaboration.

How it works: Safeguarding the Entire Community of the U.S. Research Ecosystem (SECURE), mandated under the 2022 CHIPS and Science Act, will be led by the University of Washington.

The center at the university "will share information and reports on research security risks [and] provide training on research security to the science and engineering community," NSF said in a press release.
It will also "serve as a bridge between the research community and government funding agencies to strengthen cooperation on addressing security concerns," the statement said.

Four additional regional centers around the country will be managed by five universities: Northeastern University, Emory University, University of Missouri, the University of Texas at San Antonio and Texas A&M University.

Texas A&M will also lead an effort to provide risk modeling and analytical data about the scope and scale of research security threats.
NSF and others publish some information about cases but there isn't comprehensive data about the issue.

Go deeper: A new world for science research security

4. Worthy of your time

A new element on the periodic table might be within reach (Emily Conover — Science News)

Earth likely just had its hottest two days in thousands of years (Andrew Freedman — Axios)

It's not just us: Other animals change their social habits in old age (Tim Vernimmen — Knowable)

5. Something wondrous

A polymetallic nodule from the new study undergoes testing in a lab at Northwestern University — A polymetallic nodule from the new study. Photo: Camille Bridgewater/Northwestern University

A polymetallic nodule from the new study. Photo: Camille Bridgewater/Northwestern University

Deep on the ocean floor, nodules containing rare-Earth minerals may be producing "dark oxygen," researchers reported this week.

Why it matters: Researchers are trying to understand the potential environmental impacts of plans to try to mine the nodules, which contain cobalt, nickel and other valuable metals that power batteries and solar panels.

The findings suggest nodules may have a previously unknown environmental role and, if confirmed, raise the "urgent question" of how mining activities may influence the production of "dark oxygen," researchers led by Andrew Sweetman of the Scottish Association for Marine Science wrote in Nature Geoscience.

What they found: Sweetman and his team made the discovery while studying the seabed in the Clarion-Clipperton zone, a region of the Pacific Ocean that is a potential deep-sea mining site.

During several expeditions to different sites, they measured the concentration of oxygen in the water nearly 14,000 feet below the surface and kept seeing the oxygen levels increase over several days. The same observation was made with a different instrument.
The rise in oxygen was surprising: The levels would be expected to fall over time in water captured in an instrument chamber as the gas is consumed by some organisms but presumably not produced because photosynthesis doesn't occur at that dark depth.
The rise wasn't observed in sites without the polymetallic nodules.

How it works: When the team analyzed the nodules themselves, they found they carried a high electric charge that they hypothesize could catalyze the split of water into hydrogen and oxygen in a "geo-battery."

What they're saying: "We need to rethink how to mine these materials, so that we do not deplete the oxygen source for deep-sea life," chemist Franz Geiger of Northwestern University said in a press release. Geiger is a co-author of the new paper.

Mining could remove nodules or cover them in seafloor sediment and impact the still-to-be-confirmed geochemical process with implications for organisms.

Yes, but: The process needs to be further investigated, the researchers wrote. And any alternative explanations need to be ruled out.

The intrigue: Oxygen is thought to be produced only through photosynthesis and many scientists think the origin of life on Earth is tied to photosynthetic cyanobacteria found in oceans, lakes and other ecosystems.

The discovery of potentially another source of oxygen deep in the ocean, where photosynthesis isn't possible, could then have implications for understanding how life originated on Earth, Sweetman said.

"It's a really important finding because it shows there are new processes that we haven't discovered yet," said Lisa Levin, a biological oceanographer at the Scripps Institution of Oceanography who wasn't involved in the research.

Big thanks to managing editor Scott Rosenberg, Shoshana Gordon on the Axios Visuals team and to copy editor Carolyn DiPaolo.