Anthropic, DOE team up to spot dangerous nuclear chats
Add Axios as your preferred source to
see more of our stories on Google.

Illustration: Aïda Amer/Axios
Anthropic and the U.S. government's nuclear experts have developed a new tool that can spot the difference between a scientist asking Claude about nuclear reactors and a spy probing it for secrets about weapons development.
Why it matters: Scientists can benefit from the productivity boosts of Claude and other AI models — but distinguishing between legitimate research inquiries and potentially harmful uses has been tricky to do.
Driving the news: Anthropic has been partnering with the National Nuclear Security Administration (NNSA) for over a year to find ways to safely deploy Claude in top secret environments.
- Now, they're building on that work and rolling out a new classifier in Claude that determines with 96% accuracy in testing when a conversation is likely to cause some kind of harm, the company announced today.
- Anthropic has already started rolling out the classifier on a limited amount of Claude traffic.
Between the lines: One of the biggest safety challenges for AI model makers has been policing users' chat histories to ensure they're not tricking the models into breaking their own rules.
- It can be difficult for AI providers to tell whether a particular chat involves a legitimate researcher asking questions about nuclear research or a bad actor trying to learn how to build a bomb.
Zoom in: During a year's worth of red-teaming tests, the NNSA was able to develop a list of indicators that can help Claude identify "potentially concerning conversations about nuclear weapons development."
- From there, Anthropic used that list of synthetic prompts to train and test a new classifier — which acts similarly to a spam filter on emails and tries to identify threats in real time.
The intrigue: In tests, the classifier identified 94.8% of nuclear weapons queries and didn't label anything as a false positive.
- But 5.2% of the harmful conversations were inaccurately labeled as benign.
The big picture: The new classifier tool comes as the U.S. government increasingly looks at ways to implement AI across its own workflows — and major AI companies start selling their models to the government at deep discounts.
What's next: Anthropic plans to share its approach through the Frontier Model Forum, the industry coalition it co-founded with Amazon, Meta, OpenAI, Microsoft and Google — positioning it as a model for other companies to replicate.
Editor's note: This story has been corrected to note that Anthropic's tool inaccurately labeled 5.2% of harmful conversations as benign (rather than mislabeling benign conversations as harmful).
