Anthropic's new weapon to detect abuse
Add Axios as your preferred source to
see more of our stories on Google.

Illustration: Lindsey Bailey/Axios
Anthropic's new automated analysis tool provides some fresh insights into how the model operator weeds out malicious users trying to manipulate its Claude chatbot.
Why it matters: Distinguishing adversaries' queries from run-of-the-mill user inputs is the biggest challenge model operators face in their quest to identify and stop emerging threats.
Driving the news: Anthropic released details last week about its new Clio tool — which studies what users are asking Claude in a similar way to how Google tracks search trends.
- The tool can help Anthropic assess how everyday users are relying on Claude — and it can detect new threat actors trying to use the chatbot to do their bidding.
- Anthropic even used the tool to monitor queries about elections around the world in 2024.
Zoom in: Clio extracts "facets" from each conversation with Claude, such as metadata about the conversation topic or the number of back-and-forths someone has with the chatbot.
- Conversations that are similar are then grouped together by theme or topic, and each cluster receives a new descriptive title and summary.
- Clusters are then ranked on a hierarchy that Anthropic's human analysts can use to explore patterns and potential abuses.
- For example, a cluster that's named "generate misleading content for campaign fundraising emails" would get analysts' attention, the company wrote in a blog post.
The intrigue: Clio anonymizes and aggregates all of the data it ingests, and it is instructed to remove any personal details from the conversations before clustering them.
Between the lines: Anthropic dubs this a "bottom-up" approach.
- Typically, trust and safety teams across companies set up tools that are aimed at flagging specific keywords or predicting malicious use cases. Anthropic considers this a "top-down" approach.
- Clio was able to identify malicious use cases that Anthropic's top-down approach didn't, the company said.
What we're watching: Anthropic is hoping to see other model makers adopt similar tools to help weed out abuse on their platforms.
