Aug 21, 2025 - Technology

Anthropic, DOE team up to spot dangerous nuclear chats

Sam Sabin

Illustration of a repeating pattern of robot hands holding magnifying glasses. — Illustration: Aïda Amer/Axios

Anthropic and the U.S. government's nuclear experts have developed a new tool that can spot the difference between a scientist asking Claude about nuclear reactors and a spy probing it for secrets about weapons development.

Why it matters: Scientists can benefit from the productivity boosts of Claude and other AI models — but distinguishing between legitimate research inquiries and potentially harmful uses has been tricky to do.

Driving the news: Anthropic has been partnering with the National Nuclear Security Administration (NNSA) for over a year to find ways to safely deploy Claude in top secret environments.

Now, they're building on that work and rolling out a new classifier in Claude that determines with 96% accuracy in testing when a conversation is likely to cause some kind of harm, the company announced today.
Anthropic has already started rolling out the classifier on a limited amount of Claude traffic.

Between the lines: One of the biggest safety challenges for AI model makers has been policing users' chat histories to ensure they're not tricking the models into breaking their own rules.

It can be difficult for AI providers to tell whether a particular chat involves a legitimate researcher asking questions about nuclear research or a bad actor trying to learn how to build a bomb.

Zoom in: During a year's worth of red-teaming tests, the NNSA was able to develop a list of indicators that can help Claude identify "potentially concerning conversations about nuclear weapons development."

From there, Anthropic used that list of synthetic prompts to train and test a new classifier — which acts similarly to a spam filter on emails and tries to identify threats in real time.

The intrigue: In tests, the classifier identified 94.8% of nuclear weapons queries and didn't label anything as a false positive.

But 5.2% of the harmful conversations were inaccurately labeled as benign.

The big picture: The new classifier tool comes as the U.S. government increasingly looks at ways to implement AI across its own workflows — and major AI companies start selling their models to the government at deep discounts.

What's next: Anthropic plans to share its approach through the Frontier Model Forum, the industry coalition it co-founded with Amazon, Meta, OpenAI, Microsoft and Google — positioning it as a model for other companies to replicate.

Editor's note: This story has been corrected to note that Anthropic's tool inaccurately labeled 5.2% of harmful conversations as benign (rather than mislabeling benign conversations as harmful).

Add Axios on Google

Anthropic, DOE team up to spot dangerous nuclear chats

What to read next