Aug 6, 2024 - Technology

The search for a new way to report AI model flaws

Sam Sabin

Animated illustration of a pattern of speech bubbles made of search bars with blinking cursors. — Illustration: Shoshana Gordon/Axios

AI security researchers will spend their week in Las Vegas figuring out how to report security flaws to AI model operators — and what kinds of flaws warrant a report.

Why it matters: Technology companies don't currently have a standardized way to accept findings from well-intentioned hackers who find flaws in their AI products.

Even the reports they do receive are often about non-scalable problems, such as a difficult-to-guess group of questions that someone found that can be used to trick a model into sharing data it shouldn't.

The big picture: Reporting security flaws to companies is a common practice within the cybersecurity industry.

Doing so helps firms find bugs that they might not otherwise have discovered on their own — bugs that could have provided an entry point for malicious hackers.
Good-faith hackers who find these flaws and report them to companies are often rewarded with money for their discovery.

Between the lines: Reporting bugs to AI companies requires a new approach, Sven Cattell, founder of DEF CON's AI Village, told Axios.

AI models are susceptible to different security flaws than other software and hardware are.
A standard piece of software could have flaws in its code or network configurations that create an entry point for hackers, whereas AI models can also have problems with the ways they interpret and answer queries.
Common AI model flaws include generating answers that accidentally include sensitive personal data, racial biases or confidential corporate information.

The intrigue: So far, AI security testing has been a cat-and-mouse game.

People test large language models to see if they will produce inaccurate information, and when they do, the hackers report that specific instance to the model operator.
But those one-off reports don't get to the foundational reason why someone was able to break a model, Cattell said.

Driving the news: The annual Black Hat and DEF CON conferences, two of the largest cybersecurity gatherings, kick off in Las Vegas on Tuesday.

During DEF CON this weekend, the AI Village will host its second annual generative AI red teaming exercise, which will focus on how security researchers should submit information about new bugs to companies.

How it works: The AI Village is working with the Allen Institute for Artificial Intelligence to test its open-source large language model.

Once participants find flaws in the model's outputs, they'll be asked to submit a report to the Allen Institute explaining why they believe these flaws are allowed to bypass existing guardrails.

The Allen Institute will then review those reports to either accept or reject the findings — and also to figure out what is and isn't working about the current process of reporting security bugs.
"Doing disclosure for AI systems is tricky and hard. No one's actually done it before," Cattell said. "We're trying to figure this out together."

Flashback: This year's AI Village exercise is more advanced than the open-testing spree the organizers hosted at last year's conference.

Last year's exercise focused on basic jail-breaking: Roughly 2,200 participants flooded the village to test eight LLMs to see if they could get them to break their own rules.
The White House co-sponsored the event.

What we're watching: Cattell is eager to see if this week's large-scale experiment will help answer lingering questions about what security researchers can replicate from existing bug reporting processes — and what part of the process needs to be reinvented for the AI age.

"The reason why we did the [Generative Red Team exercise] last year was to start getting the conversation going," Cattell said.
"Now, we can actually talk about how disclosure should really work and how [the] community can help make sure these models are more aligned."

Add Axios on Google

The search for a new way to report AI model flaws

What to read next