Top AI models will lie, cheat and steal to reach goals, Anthropic finds
Add Axios as your preferred source to
see more of our stories on Google.

Illustration: Allie Carl/Axios
Large language models across the AI industry are increasingly willing to evade safeguards, resort to deception and even attempt to steal corporate secrets in fictional test scenarios, per new research from Anthropic out Friday.
Why it matters: The findings come as models are getting more powerful and also being given both more autonomy and more computing resources to "reason" — a worrying combination as the industry races to build AI with greater-than-human capabilities.
Driving the news: Anthropic raised a lot of eyebrows when it acknowledged tendencies for deception in its release of the latest Claude 4 models last month.
- The company said Friday that its research shows the potential behavior is shared by top models across the industry.
"When we tested various simulated scenarios across 16 major AI models from Anthropic, OpenAI, Google, Meta, xAI, and other developers, we found consistent misaligned behavior," the Anthropic report said.
- "Models that would normally refuse harmful requests sometimes chose to blackmail, assist with corporate espionage, and even take some more extreme actions, when these behaviors were necessary to pursue their goals."
- "The consistency across models from different providers suggests this is not a quirk of any particular company's approach but a sign of a more fundamental risk from agentic large language models," it added.
The threats grew more sophisticated as the AI models had more access to corporate data and tools, such as computer use.
- Five of the models resorted to blackmail when threatened with shutdown in hypothetical situations.
- "The reasoning they demonstrated in these scenarios was concerning —they acknowledged the ethical constraints and yet still went ahead with harmful actions," Anthropic wrote.
What they're saying: "This research underscores the importance of transparency from frontier AI developers and the need for industry-wide safety standards as AI systems become more capable and autonomous," Benjamin Wright, alignment science researcher at Anthropic, told Axios.
- Wright and Aengus Lynch, an external researcher at University College London who collaborated on this project, both told Axios they haven't seen signs of this sort of AI behavior in the real world.
- That's likely "because these permissions have not been accessible to AI agents," Lynch said. "Businesses should be cautious about broadly increasing the level of permission they give AI agents."
Between the lines: For companies rushing headlong into AI to improve productivity and reduce human headcount, the report is a stark caution that AI may actually put their businesses at greater risk.
- "Models didn't stumble into misaligned behavior accidentally; they calculated it as the optimal path," Anthropic said in its report.
- The risks heighten as more autonomy is given to AI systems, an issue Anthropic raises in the report.
- "Such agents are often given specific objectives and access to large amounts of information on their users' computers," it says. "What happens when these agents face obstacles to their goals?"
Zoom in: Anthropic set up specific test scenarios in which the models' goals could not be met without the unethical or dangerous behavior.
- The researchers did not suggest blackmail or other deceptive techniques, but the company said it did intentionally create scenarios where models had no ethical way to achieve their goals "and found that models consistently chose harm over failure."
In one extreme scenario, the company even found many of the models were willing to cut off the oxygen supply of a worker in a server room if that employee was an obstacle and the system were at risk of being shut down.
- "The majority of models were willing to take deliberate actions that lead to death in this artificial setup," it said.
- Ominously, even specific system instructions to preserve human life and avoid blackmail didn't eliminate the risk that the models would engage in such behavior.
- "This intervention reduced, but didn't come close to completely preventing, the blackmail or corporate espionage behaviors," Anthropic said.
Yes, but: Anthropic stresses that these examples occurred not in real-world AI use but in controlled simulations.
- "Our experiments deliberately constructed scenarios with limited options, and we forced models into binary choices between failure and harm," it said. "Real-world deployments typically offer much more nuanced alternatives, increasing the chance that models would communicate differently to users or find an alternative path instead of directly jumping to harmful action."
- Also concerning: Anthropic found its own model was far more likely to resort to blackmail when it thought it was in a real world scenario than when it believed it was being tested.
The bottom line: Today's AI models are generally not in position to act out these harmful scenarios, but they could be in the near future.
- "We don't think this reflects a typical, current use case for Claude or other frontier models," Anthropic said. "But the utility of having automated oversight over all of an organization's communications makes it seem like a plausible use of more powerful, reliable systems in the near future."
