Dec 17, 2024 - Technology

Anthropic's new weapon to detect abuse

Sam Sabin

Illustration of a speech bubble wearing a disguise amongst a crowd of other speech bubbles. — Illustration: Lindsey Bailey/Axios

Anthropic's new automated analysis tool provides some fresh insights into how the model operator weeds out malicious users trying to manipulate its Claude chatbot.

Why it matters: Distinguishing adversaries' queries from run-of-the-mill user inputs is the biggest challenge model operators face in their quest to identify and stop emerging threats.

Driving the news: Anthropic released details last week about its new Clio tool — which studies what users are asking Claude in a similar way to how Google tracks search trends.

The tool can help Anthropic assess how everyday users are relying on Claude — and it can detect new threat actors trying to use the chatbot to do their bidding.
Anthropic even used the tool to monitor queries about elections around the world in 2024.

Zoom in: Clio extracts "facets" from each conversation with Claude, such as metadata about the conversation topic or the number of back-and-forths someone has with the chatbot.

Conversations that are similar are then grouped together by theme or topic, and each cluster receives a new descriptive title and summary.
Clusters are then ranked on a hierarchy that Anthropic's human analysts can use to explore patterns and potential abuses.
For example, a cluster that's named "generate misleading content for campaign fundraising emails" would get analysts' attention, the company wrote in a blog post.

The intrigue: Clio anonymizes and aggregates all of the data it ingests, and it is instructed to remove any personal details from the conversations before clustering them.

Between the lines: Anthropic dubs this a "bottom-up" approach.

Typically, trust and safety teams across companies set up tools that are aimed at flagging specific keywords or predicting malicious use cases. Anthropic considers this a "top-down" approach.
Clio was able to identify malicious use cases that Anthropic's top-down approach didn't, the company said.

What we're watching: Anthropic is hoping to see other model makers adopt similar tools to help weed out abuse on their platforms.

Add Axios on Google

Anthropic's new weapon to detect abuse

What to read next