Jun 21, 2024 - Science

Battling AI's error problem: Experts craft BS detector

Alison Snyder

Illustration of a distorted computer with binary code flowing out of the screen with abstract shapes around it. — Illustration: Gabriella Turrisi/Axios

A new algorithm, along with a dose of humility, might help generative AI mitigate one of its persistent problems: confident but inaccurate answers.

Why it matters: AI errors are especially risky if people overly rely on chatbots and other tools for medical advice, legal precedents or other high-stakes information.

A new Wired investigation found AI-powered search engine Perplexity churns out inaccurate answers.

The big picture: Today's AI models make several kinds of mistakes — some of which may be harder to solve than others, says Sebastian Farquhar, a senior research fellow in the computer science department at the University of Oxford.

But all these errors are often lumped together as "hallucinations" — a term Farquhar and others argue has become useless because it encompasses so many different categories.

Driving the news: Farquhar and his Oxford colleagues this week reported developing a new method for detecting "arbitrary and incorrect answers," called confabulations, the team writes in Nature. It addresses "the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words."

The method involves asking a chatbot a question several times — i.e. "Where is the Eiffel Tower?"
A separate large language model (LLM) grouped the chatbot's responses — "It's Paris," "Paris," "France's capital Paris," "Rome," "It's Rome," "Berlin" — based on their meaning.
Then they calculated the "semantic entropy" for each group — a measure of the similarity among the responses. If the responses are different — Paris, Rome and Berlin — the model is likely to be confabulating.

What they found: The approach can determine whether an answer is a confabulation about 79% of the time — compared to 69% for a detection measure that assesses similarity based on the words in a response, and similar performance by two other methods.

Yes, but: It will only detect inconsistent errors — not those produced if a model is trained on biased data or erroneous data.

It also requires about five to 10 times as much computing power as a typical chatbot interaction.
"For some applications, that would be a problem, and for some applications, that's totally worth it," says Farquhar says, who is also now a senior research scientist at Google DeepMind.

What they're saying: "Developing approaches to detect confabulations is a big step in the right direction, but we still need to be cautious before accepting outputs as correct," Jenn Wortman Vaughan, a senior principal researcher at Microsoft Research, told Axios in an email.

"We're never going to be able to develop LLMs that are perfectly accurate, so we need to find ways to convey to users what mistakes might look like and help them set their expectations appropriately."

Vaughan and other researchers are looking at ways to have AI systems communicate the uncertainty in their answers — to get them, in effect, to be more humble.

But, "figuring out the right notion of uncertainty to convey — and how to compute it — is a huge" open question, she says adding it will likely depend on the application.
In a new paper, Vaughan and her colleagues look at how people perceive a model's expression of uncertainty when a fictional "LLM-infused" search engine answered a medical question. (For example, "Can an adult who has not had chickenpox get shingles?")
Participants were shown the AI response and asked to report how confident they were in it. They then answered the question themselves and said how confident they were in their own answer.

They found that people who were shown AI answers with first-person expressions of uncertainty — "I'm not sure, but..." — were less confident in the AI's responses and agreed with its answers less often compared to participants who saw no expression of uncertainty. (More general expressions — "There is uncertainty but it seems..." — had a similar but insignificant effect.)

That suggests "natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters," they write.
The researchers note their study has several limitations, including people had single interactions with the system, and they didn't look at more complex tasks — like writing an article — or explore cultural or language differences.
Conveying uncertainty needs to center on the needs of the users, Vaughan says. "How do we empower them to make the best choices about how much to rely on the system and what information to trust? We can't answer these types of questions with technical solutions alone."

Between the lines: The most advanced chatbots from OpenAI, Meta, Google and others "hallucinate" at rates between 2.5% and 5% when summarizing a document.

Some of the errors produced in earlier versions don't occur in the latest ones, but "it's sort of moving the problem," Farquhar says.
And while giving an algorithm extra training data can make it "more accurate on things that you know you care about," people may want to ask much more of AI, stretching it beyond the data it is trained on and opening up the possibility for errors and fabrications, he says.

"In some contexts, hallucination is a factuality problem," Farquhar says.

"In other contexts, it is creativity and finding new ways of expressing imaginary ideas. If you're trying to generate fiction, for example, the same thing that's causing the hallucinations might be genuinely something you want."

Add Axios on Google

Battling AI's error problem: Experts craft BS detector

What to read next