OpenAI updates its system for evaluating AI risks
Add Axios as your preferred source to
see more of our stories on Google.

Illustration: Natalie Peeples/Axios
OpenAI is making several changes to the system it uses to evaluate the risks new models pose, adding new categories for models that could self-replicate or conceal their capabilities.
Why it matters: OpenAI uses its "preparedness framework" to decide whether AI models are safe and what, if any, safeguards are needed during development and for public release.
Driving the news: In another change, OpenAI will no longer specifically evaluate models on their persuasive capabilities — an area where its recent models had already risen to "medium" risk level.
- The company is also doing away with distinguishing between "low" and "medium" risk and will focus on deciding whether or not risks reach the "high" or "critical" levels.
In addition to continuing to monitor the risk that AI might be used to create bioweapons or gain a capacity for self-improvement, OpenAI is adding several new "research" categories — such as whether a model can conceal capabilities, evade safeguards or seek to replicate itself or prevent shutdowns.
- "We are on the cusp of systems that can do new science, and that are increasingly agentic — systems that will soon have the capability to create meaningful risk of severe harm," OpenAI said in the updated framework. "This means we will need to design and deploy safeguards we can rely on for safety and security."
- The changes are the first OpenAI has made to the framework since it was unveiled in December 2023.
What they're saying: In an interview, OpenAI safety researcher Sandhini Agarwal told Axios the changes are designed to shift the company's efforts toward safeguards that protect against the most severe risks.
- "The purpose of the framework is to focus on catastrophic risks," she said. "This is not the be-all, end-all of safety at OpenAI."
Between the lines: The new research categories align with broader industry discussion around the prospect that models might act differently in testing than in the real world and that they might try to conceal their capabilities.
- Anthropic released an eye-opening paper last month suggesting large-language models can do more planning than is visible and potentially misrepresent their reasoning.
- "These are very much early glimmers," Agarwal said. "We want to understand those."
