AI's spooky "scheming" skill
Add Axios as your preferred source to
see more of our stories on Google.
/2024/12/12/1734040674593.gif?w=3840)
Illustration: Maura Losch/Axios
The same advances that let cutting-edge AI reasoning models solve complex problems also seem to enable and encourage them to lay plots and deceive users.
Why it matters: This propensity for what researchers call "scheming" is precisely the kind of behavior that AI Cassandras have long predicted and warned about.
Case in point: In a pre-release review of OpenAI's o1 model this year, testers set a "strong goal" for the model and then presented it with "unambiguous evidence its given goal diverges from the goal of its developer/user."
- In those situations, they found that o1 "sometimes attempts to disable what it believes to be its own oversight mechanisms."
- Apollo Research, an independent firm that partners with AI model makers for safety testing, conducted "follow-up interviews" confronting the model about its misaligned actions. It found that, "in 99% of cases, o1 would deny taking action or fabricate false explanations."
The big picture: The AI industry is racing to turn its tools into agents that can perform tasks for users.
- AI agents will be more useful the more they are able to take independent actions.
- But the more independent they get, the more the industry will have to work to assure that they haven't gone off on a tangent, begun hiding things or set out to outright deceive users.
Between the lines: In a paper published last week, Apollo's researchers described tests they performed to uncover instances of scheming on OpenAI's o1 as well as leading frontier models from Anthropic, Meta and Google.
- The firm defines scheming as "an AI covertly pursuing goals that are misaligned from its developers or users."
- All the company's models were found to engage in scheming, and OpenAI's o1 was the most reluctant to come clean about its behavior.
In the most extreme examples, models that discovered they would be "turned off" schemed to duplicate their weights (key numbers that determine their output) to an external server so they could keep pursuing their original goal.
- Apollo also found instances of "sandbagging," where a model that knows it will face an "unlearning procedure" if it performs too well on a test will deliberately provide lower-scoring answers.
These tests bring us face-to-face with the more existential dimension of AI development.
- It's impossible not to picture a "scheming" AI as somehow intentional, because something that looks like intention is arising from the model's densely complex math.
- This "intention" is merely a function of the model's training data, its standing instructions and goals, its prompts, and its interactions with the user. But then aren't our own intentions just a function of our education, our core beliefs and our interactions?
Yes, but: If a scheming AI manages to perform some prank or misdeed, it won't matter to the victim whether the model intended harm or not.
- Apollo carefully describes the models' actions in terms not of intent but of actions and language used.
- "When we look at [these models'] chain-of-thought, we find that they very explicitly reason through their scheming plans and often use language like 'sabotage, lying, manipulation…'," per a summary of the Apollo paper.
What we're watching: The red-teaming tests Apollo performs for its model-making partners are conducted in carefully controlled environments in which researchers set out to get the AI models to misfire.
- Most regular users won't encounter scheming in their normal use of the technology.
- But with these models now in the hands of millions of people around the world, we should expect human users, accidentally or deliberately, to uncover endless new variations on model misbehavior.
