Axios AI+

December 13, 2024

Ina Fried

Happy Friday! Today's AI+ is 960 words, a 3.5-minute read.

1 big thing: Chatbots are learning to lie

Scott Rosenberg

Animated illustration of an emoticon with a very long nose and a blinking cursor. — Illustration: Maura Losch/Axios

The same advances that let cutting-edge AI reasoning models solve complex problems also seem to enable and encourage them to lay plots and deceive users.

Why it matters: This propensity for what researchers call "scheming" is precisely the kind of behavior that AI Cassandras have long predicted and warned about.

Case in point: In a pre-release review of OpenAI's o1 model this year, testers set a "strong goal" for the model and then presented it with "unambiguous evidence its given goal diverges from the goal of its developer/user."

In those situations, they found that o1 "sometimes attempts to disable what it believes to be its own oversight mechanisms."
Apollo Research, an independent firm that partners with AI model makers for safety testing, conducted "follow-up interviews" confronting the model about its misaligned actions. It found that, "in 99% of cases, o1 would deny taking action or fabricate false explanations."

The big picture: The AI industry is racing to turn its tools into agents that can perform tasks for users.

AI agents will be more useful the more they are able to take independent actions.
But the more independent they get, the more the industry will have to work to assure they haven't gone off on a tangent, begun hiding things or set out to outright deceive users.

Between the lines: In a paper published last week, Apollo's researchers described tests they performed to uncover instances of scheming on OpenAI's o1 as well as leading frontier models from Anthropic, Meta and Google.

The firm defines scheming as "an AI covertly pursuing goals that are misaligned from its developers or users."
All the company's models were found to engage in scheming, and OpenAI's o1 was the most reluctant to come clean about its behavior.

In the most extreme examples, models that discovered they would be "turned off" schemed to duplicate their weights (key numbers that determine their output) to an external server so they could keep pursuing their original goal.

Apollo also found instances of "sandbagging," where a model that knows it will face an "unlearning procedure" if it performs too well on a test will deliberately provide lower-scoring answers.

These tests bring us face-to-face with the more existential dimension of AI development.

It's impossible not to picture a "scheming" AI as somehow intentional, because something that looks like intention is arising from the model's densely complex math.
This "intention" is merely a function of the model's training data, its standing instructions and goals, its prompts, and its interactions with users. But then aren't our own intentions just a function of our education, our core beliefs and our interactions?

Yes, but: If a scheming AI manages to perform some prank or misdeed, it won't matter to the victim whether the model intended harm or not.

Apollo carefully describes the models' actions in terms not of intent but of actions and language used.
"When we look at [these models'] chain-of-thought, we find that they very explicitly reason through their scheming plans and often use language like 'sabotage, lying, manipulation…'," per a summary of the Apollo paper.

What we're watching: The red-teaming tests Apollo performs for its model-making partners are conducted in carefully controlled environments in which researchers set out to get the AI models to misfire.

Most regular users won't encounter scheming in their normal use of the technology.
But with these models now in the hands of millions of people around the world, we should expect human users, accidentally or deliberately, to uncover endless new variations on model misbehavior.

2. ChatGPT can now "see" what your phone sees

Megan Morrone

OpenAI is rolling out the ability to share your phone screen and live video from your phone in the ChatGPT mobile app's Advanced Voice mode so users don't have to upload photos or describe their surroundings in chats, the company announced yesterday.

Why it matters: Screen and video sharing could make voice chats more efficient and useful, but they also offer OpenAI more access to a user's potentially sensitive personal information.

How it works: OpenAI says screen and video sharing will be available in advanced voice mode by tapping the voice icon in the chat bar.

OpenAI says the features will be rolling out for ChatGPT Plus and Pro users in most countries, as well as to all ChatGPT Team users.
Enterprise and Edu users will have access to the feature in January. ChatGPT Plus and Pro users in the EU, Switzerland, Iceland, Norway, and Liechtenstein will get the feature "soon," per OpenAI.

Context: The release follows an announcement from Google about advances by Project Astra, the experimental AI assistant that uses an Android app or prototype glasses to record the world as a person is seeing it.

Astra is available only to "trusted testers," with a waitlist for those who want to join.

3. Training data

Yesterday it was Meta, and now Amazon and OpenAI CEO Sam Altman say they'll kick in $1 million to the Trump inauguration. (Axios, Fox Business)
Jury selection is set to begin today in federal court in Delaware in a high-stakes legal showdown between chip designer Arm and Qualcomm, its biggest customer. (Axios)
Microsoft launched a research preview of a new smaller language model, Phi-4, that the company says reasons better than predecessors, particularly in math. (TechCrunch)
Harvard, with funding from OpenAI and Microsoft, released a big new AI training dataset of nearly 1 million public domain books. (Wired)
Google debuted Android XR, the company's latest foray into augmented and virtual reality, with mixed reality glasses from Samsung due out next year. (Road to VR)
Meta yesterday released Meta Motivo, an AI model designed to control movement of a human figure in the metaverse. (Reuters)

4. + This

Most kids who believe in Santa Claus can't use ChatGPT's new Santa voice since OpenAI's terms of service say ChatGPT is for users 13 and up.

Thanks to Megan Morrone and Scott Rosenberg for editing this newsletter and Anjelica Tan for copy editing it.