Aug 8, 2024 - Technology

Exclusive: Anthropic wants to pay hackers to find model flaws

Sam Sabin

Illustration of a flashlight illuminating binary code in red — Illustration: Sarah Grillo/Axios

Anthropic is testing a new program to pay well-intentioned hackers who find flaws in its model output review systems, the company first shared with Axios.

Why it matters: No tech company currently has a formalized process to pay independent security researchers who discover safety flaws in their chatbot outputs.

These payouts, which happen through what's known as a bug bounty program, are a common practice in the cybersecurity industry.
Bug bounty programs help companies find bugs they may have otherwise missed and provide incentive for hackers to report their findings, rather than exploit them for malicious attacks.

Zoom in: Anthropic announced Thursday that in partnership with HackerOne it will start testing an expansion of its invite-only bug bounty program to receive findings of successful universal jailbreak attacks.

Invited participants will be given access to Anthropic's yet-to-be-released AI safety system to test for ways to bypass its rules and output filters.
But unlike most jailbreaking exercises, Anthropic is only accepting reports for model flaws that are repeatable and can consistently elicit a variety of different bad outputs.
Those flaws would also need to have the potential to "have far-reaching consequences across a variety of harmful, unethical or dangerous areas," according to a blog post shared with Axios.
A flaw that just shows that a model could spit out credit card information in one specific, non-repeatable instance wouldn't be accepted, for example.

What they're saying: "In the past, developer bug bounty programs have not focused on jailbreaks and many of them specifically exclude jailbreaks," Michael Sellitto, head of global affairs at Anthropic, told Axios. "There's a broad recognition that all currently deployed models are jailbreakable to some extent."

The intrigue: Successful discoveries are eligible for a reward totaling up to $15,000.

Between the lines: Anthropic is one of the several companies that have signed onto the White House's voluntary AI safety commitments, which include a pledge to facilitate third-party vulnerability reporting of AI systems.

What's next: Experienced security researchers interested in the program can submit an application form by Aug. 16. Chosen applicants will be notified in the fall.

Anthropic eventually plans to expand the program more broadly after refining the process based on initial submissions.

Go deeper: The search for a new way to report AI model flaws

Editor's note: This story has been updated to add that the program is in partnership with HackerOne.

Add Axios on Google

Exclusive: Anthropic wants to pay hackers to find model flaws

What to read next