Exclusive: Anthropic wants to pay hackers to find model flaws
Add Axios as your preferred source to
see more of our stories on Google.

Illustration: Sarah Grillo/Axios
Anthropic is testing a new program to pay well-intentioned hackers who find flaws in its model output review systems, the company first shared with Axios.
Why it matters: No tech company currently has a formalized process to pay independent security researchers who discover safety flaws in their chatbot outputs.
- These payouts, which happen through what's known as a bug bounty program, are a common practice in the cybersecurity industry.
- Bug bounty programs help companies find bugs they may have otherwise missed and provide incentive for hackers to report their findings, rather than exploit them for malicious attacks.
Zoom in: Anthropic announced Thursday that in partnership with HackerOne it will start testing an expansion of its invite-only bug bounty program to receive findings of successful universal jailbreak attacks.
- Invited participants will be given access to Anthropic's yet-to-be-released AI safety system to test for ways to bypass its rules and output filters.
- But unlike most jailbreaking exercises, Anthropic is only accepting reports for model flaws that are repeatable and can consistently elicit a variety of different bad outputs.
- Those flaws would also need to have the potential to "have far-reaching consequences across a variety of harmful, unethical or dangerous areas," according to a blog post shared with Axios.
- A flaw that just shows that a model could spit out credit card information in one specific, non-repeatable instance wouldn't be accepted, for example.
What they're saying: "In the past, developer bug bounty programs have not focused on jailbreaks and many of them specifically exclude jailbreaks," Michael Sellitto, head of global affairs at Anthropic, told Axios. "There's a broad recognition that all currently deployed models are jailbreakable to some extent."
The intrigue: Successful discoveries are eligible for a reward totaling up to $15,000.
Between the lines: Anthropic is one of the several companies that have signed onto the White House's voluntary AI safety commitments, which include a pledge to facilitate third-party vulnerability reporting of AI systems.
What's next: Experienced security researchers interested in the program can submit an application form by Aug. 16. Chosen applicants will be notified in the fall.
- Anthropic eventually plans to expand the program more broadly after refining the process based on initial submissions.
Go deeper: The search for a new way to report AI model flaws
Editor's note: This story has been updated to add that the program is in partnership with HackerOne.
