Axios AI+

July 29, 2024

Ina Fried

It was a busy but fun weekend in Paris covering everything from the U.S.' first gold medal of the Games to the Canadian soccer drone scandal to Simone Biles and the first round of women's gymnastics, along with women's skateboarding — a personal favorite. You can catch all of Axios' Paris Olympics coverage here.

Today's AI+ is 1,105 words, a 4-minute read.

1 big thing: AI's brain on AI

Alison Snyder

Illustration of copies of a human brain gradually becoming more distorted. — Illustration: Shoshana Gordon/Axios

Data to train AI models increasingly comes from other AI models in the form of synthetic data, which can fill in chatbots' knowledge gaps but also destabilize them.

The big picture: As AI models expand in size, their need for data becomes insatiable — but high-quality human-made data is costly, and growing restrictions on text, images and other kinds of data freely available on the web are driving the technology's developers toward machine-produced alternatives.

Those constraints and the decreasing cost of generating synthetic data are spurring companies to use AI-generated data to help train their models.

Meta, Google, Anthropic and others are using synthetic data — alongside human-generated data — to help train the AI models that power their chatbots.

New research illustrates the potential effects AI-generated data has on the answers AI can give us.

In one scenario, researchers trained a generative AI model largely on AI-generated data. The model eventually became incoherent: They called it a case of "model collapse" in a paper published Wednesday in Nature.
The team fine-tuned a large language model using a dataset from Wikipedia, generated data from the AI model and then fed it back into the model to fine-tune it again. They did this repeatedly, feeding each new model data generated by the previous one.
They found the training data is polluted over the generations, eventually causing the model to respond with gibberish.
For example, it was prompted with text about medieval architecture and after nine generations was outputting text about jackrabbits.

How it works: The model starts to lose information about data that doesn't appear as often in the training set and eventually collapses from the number of errors introduced, the team writes.

The AI responded with "things that have no resemblance to reality," Ilia Shumailov, a co-author on the paper, which was written when he was at the University of Oxford, tells Axios.

Between the lines: Training with synthetic data carries particular risks for information from underrepresented groups of people or languages that don't appear often in a data set, Shumailov tells Axios.

In another recent paper, he and other researchers tracked shifts in data over generations of models trained on synthetic data and found they could lead to a loss of fairness — even in datasets that were initially unbiased.
It's likely "going to be harder to build models and harder to build fair models because the majority of the problems that we will experience are going to be experienced by minority data," Shumailov says.

Yes, but: AI-generated data can also be a powerful tool to address limitations in data.

New research shows how it can be tailored to specific needs or questions and then used to steer models' responses to produce less harmful speech, represent more languages or provide other desired output.
A team from Cohere for AI, Cohere's nonprofit AI research lab, recently reported being able to use targeted sampling of AI-generated data to reduce toxic responses from a model by up to 40%.
Shumailov and his colleagues performed "algorithmic reparation" by curating training data to improve fairness in models.

The big questions are whether synthetic data can represent the breadth of humanity and its experience, and whether it can be used to surpass the best model out there, Sara Hooker, who leads Cohere for AI, says.

"That's the crux of the discussion within the research community, and it is very far from decided."

The intrigue: When 10% of the original human-generated data was retained, the model's performance didn't suffer, the team reports in the Nature paper.

Such data could be given more weight in training a model to protect it from collapsing, but it is currently difficult to tell real data from synthetic data, Shumailov says.

The bottom line: AI-generated data is "an amazingly useful technology, but if you use it indiscriminately, it's going to run into problems," Vyas Sekar, a professor of electrical and computer engineering at Carnegie Mellon University, tells Axios.

2. Top security trainer falls for AI-fueled scam

Sam Sabin

Illustration of a laptop wearing glasses and moustache set disguise. — Illustration: Aïda Amer/Axios

KnowBe4, a well-regarded security training company, is the latest to fall victim to a long-running North Korean IT worker scam.

Why it matters: Even the companies designed to fend off these threats haven't figured out a way to avoid them.

The big picture: North Korean workers have gotten scary good at gaming U.S. hiring practices to score coveted remote jobs: both to make money for the regime and to collect U.S. company secrets.

Many of these job candidates tap AI tools to obfuscate their voices or change their images during calls so they go undetected.

Zoom in: KnowBe4 CEO Stu Sjouwerman wrote in a blog post last week that the company recently discovered and fired an employee who was one of these North Korean IT workers.

KnowBe4 had conducted four videoconference interviews, run a background check, and even confirmed the person matched the photo provided on his application before hiring him.
But the candidate had stolen a U.S.-based identity and used AI tools to enhance a stock image to bypass an ID check, Sjouwerman said.

What happened: On July 15, KnowBe4's security team detected a "series of suspicious activities" coming from the new employee's laptop.

After the new employee declined to hop on a phone call for several hours, the IT team decided to wall off his computer from the rest of the corporate network.
The employee wasn't able to illegally access any of KnowBe4's systems and no data was lost, stolen or compromised, Sjouwerman wrote.
However, the employee did try to load infostealer malware onto his machine. Sjouwerman said the company isn't quite sure why.

Threat level: Insider threats have become a bigger issue as American AI companies continue to dominate the industry.

The bottom line: KnowBe4 recommended that other companies employ tough job-candidate vetting, conduct all remote job interviews with cameras on and only ship laptops to the address where the candidate lives.

3. Training data

Apple's AI — called Apple Intelligence — will arrive in October, which is later than planned, but developers should be able to test features this week. (Bloomberg)
AI in the legal field is convincing investors to cut big checks. Canadian legaltech AI company Clio just raised $900 million in Series F funding, and last week OpenAI-backed Harvey raised $100 million. (Axios, TechCrunch)
Hackers are competing to win millions in Pentagon funding for developing new ways of harnessing AI to find flaws in open source software. (Washington Post)

4. + This

A Lego version of Axios' Ina Fried with the Paris Olympics gymnastics venue in the background — Photo: Ina Fried/Axios

Lego Ina is also at the Paris Games, with her very own press credential. Sadly, I left her in the pocket of a pair of shorts when I did laundry on Saturday morning. Fortunately, she's quite resilient and made it with me to gymnastics on Sunday.

Thanks to Scott Rosenberg and Megan Morrone for editing this newsletter and to Caitlin Wolper for copy editing it.