Axios Science

June 20, 2024

Alison Snyder

Thanks for reading Axios Science. This week's newsletter is 1,764 words, about a 6½-minute read.

Send your feedback and ideas to me at [email protected].

1 big thing: Cutting through the BS of AI

Illustration of a distorted computer with binary code flowing out of the screen with abstract shapes around it. — Illustration: Gabriella Turrisi/Axios

A new algorithm, along with a dose of humility, might help generative AI mitigate one of its persistent problems: confident but inaccurate answers.

Why it matters: AI errors are especially risky if people overly rely on chatbots and other tools for medical advice, legal precedents or other high stakes information.

A new Wired investigation found AI-powered search engine Perplexity churns out inaccurate answers.

The big picture: Today's AI models make several kinds of mistakes — some of which may be harder to solve than others, says Sebastian Farquhar, a senior research fellow in the computer science department at the University of Oxford.

But all these errors are often lumped together as "hallucinations" — a term Farquhar (and others) argue has become useless because it encompasses so many different categories.

Driving the news: Farquhar and his Oxford colleagues this week reported developing a new method for detecting confabulations that addresses "the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words," the team writes in Nature.

The method involves asking a chatbot a question several times — i.e. "Where is the Eiffel Tower?"
A separate large language model (LLM) grouped the chatbot's responses — "It's Paris," "Paris," "France's capital Paris," "Rome," "It's Rome," "Berlin" — based on their meaning.
Then they calculated the "semantic entropy" for each group — a measure of the similarity among the responses in each cluster. A third LLM checked the accuracy of the responses.

What they found: The approach can determine whether an answer is correct about 79% of the time — compared to 69% for a detection measure that assesses similarity based on the words in a response, and similar performance by two other methods.

Yes, but: It will only detect inconsistent errors — not those produced if a model is trained on biased or erroneous data.
It also requires about five to 10 times as much computing power as a typical chatbot interaction.
"For some applications, that would be a problem, and for some applications, that's totally worth it," Farquhar says.

What they're saying: "Developing approaches to detect confabulations is a big step in the right direction, but we still need to be cautious before accepting outputs as correct," Jenn Wortman Vaughan, a senior principal researcher at Microsoft Research, told me in an email.

"We're never going to be able to develop LLMs that are perfectly accurate, so we need to find ways to convey to users what mistakes might look like and help them set their expectations appropriately."

2. Part II: A more humble AI

Vaughan and other researchers are looking at ways to have AI systems communicate the uncertainty in their answers — to get them, in effect, to be more humble.

But, "figuring out the right notion of uncertainty to convey — and how to compute it — is a huge" open question, she says adding it will likely depend on the application.
In a new paper, Vaughan and her colleagues look at how people perceive a model's expression of uncertainty when a fictional "LLM-infused" search engine answered a medical question. (For example, "Can an adult who has not had chickenpox get shingles?")
Participants were shown the AI response and asked to report how confident they were in it. They then answered the question themselves and said how confident they were in their own answer.

They found that people who were shown AI answers with first-person expressions of uncertainty — "I'm not sure, but..." — were less confident in the AI's responses and agreed with its answers less often compared to participants who saw no expression of uncertainty. (More general expressions — "There is uncertainty but it seems..." — had a similar but insignificant effect.)

That suggests "natural language expressions of uncertainty may be an effective approach for reducing overreliance on LLMs, but that the precise language used matters," they write.
The researchers note their study has several limitations, including people had single interactions with the system, and they didn't look at more complex tasks — like writing an article — or explore cultural or language differences.
Conveying uncertainty needs to center on the needs of the users, Vaughan says. "How do we empower them to make the best choices about how much to rely on the system and what information to trust? We can't answer these types of questions with technical solutions alone."

Between the lines: The most advanced chatbots from OpenAI, Meta, Google and others "hallucinate" at rates between 2.5% and 5% when summarizing a document.

Some of the errors produced in earlier versions don't occur in the latest ones, but "it's sort of moving the problem," Farquhar says.
And while giving an algorithm extra training data can make it "more accurate on things that you know you care about," people may want to ask much more of AI, stretching it beyond the data it is trained on and opening up the possibility for errors and fabrications, he says.

"In some contexts, hallucination is a factuality problem," Farquhar says.

"In other contexts, it is creativity and finding new ways of expressing imaginary ideas. If you're trying to generate fiction, for example, the same thing that's causing the hallucinations might be genuinely something you want."

3. U.S. faces mounting criticism over bird flu response

Illustration of surveillance camera with red cross in lens — Illustration: Sarah Grillo/Axios

A growing number of high-profile public health experts are raising alarms over what they say are lackluster efforts to track and contain the spread of bird flu across U.S. dairy farms, Axios' Tina Reed writes.

Why it matters: If this is a test of whether the U.S. is better prepared to respond to a pandemic threat after COVID-19, we're not getting high marks.

The big picture: It's been nearly three months since the H5N1 bird flu virus was found to have spilled over to cows, but experts say there still isn't a reliable picture of how widely the virus is spreading.

Bird flu has been detected in 92 dairy cattle herds across 12 states, according to the Centers for Disease Control and Prevention. Mild infections have been confirmed in three U.S. dairy farm workers.
Officials say the risk to the public remains low, but a big worry is whether the virus mutates in a way that allows it to easily spread between humans.

Zoom in: Officials have been testing cows and farm workers for the virus, as well as conducting wastewater surveillance to get a better picture of where it's circulating.

But there's little doubt among experts that the U.S. has been missing cases in cows and humans. Farmers have been reluctant to participate in surveillance, and only 45 dairy farm workers in the U.S. have been tested as of June 13.
Doctors say limited availability of bird flu tests could also make it difficult to detect potential cases who show up in their offices.

Amesh Adalja, a senior scholar at Johns Hopkins Center for Health Security, told Axios there's not even enough information to know whether cases among cows are trending up or down.

Michael Osterholm, director of the University of Minnesota's infectious disease research center, said it's also important to figure out how long it takes for infected herds to clear the virus to get an idea of the risk window for workers.
The CDC recently traced the spillover from birds to cows to a single event in late 2023. But additional data from farms could show whether there have been other spillover events, Osterholm said.

What they're saying: "We failed — through two administrations — to develop and implement an effective surveillance strategy with COVID, and we are repeating the same mistakes," Jerome Adams, surgeon general under former President Trump, told Politico.

The other side: There are key differences between the initial COVID and H5N1 responses, said CDC principal deputy director Nirav Shah.

Scientists already have two decades of research on this bird flu strain and there are medications that work and an on-the-shelf vaccine that can be manufactured quickly, he said.
"That puts us in a different position," Shah told Axios.

The bottom line: "The fact that we're having this much of a problem with this one really doesn't bode well for the next one," Adalja said, adding, "There's always going to be a next one."

4. Worthy of your time

Why some people seem immune to catching COVID (Sonali Roy — New Scientist)

Pain may take different pathways in men and women (Claire Yuan — Science News)

The koala paradox (Katherine Wu — The Atlantic)

A massive black hole may be "waking up" in a nearby galaxy (Sharmila Kuthunur — Space.com)

5. Something wondrous

A NASA image of the Voyager 1 space probe, launched in 1977 to study the outer solar system and interstellar space. Photo: NASA

NASA's Voyager 1 is back "conducting normal science operations" for the first time since a technical glitch some seven months ago sidelined the spacecraft, space agency officials announced.

Why it matters: The spacecraft that launched in 1977 has collected key scientific data, and at more than 15 billion miles from Earth, it's the human-made object farthest from our planet, Axios' Rebecca Falconer writes.

Voyager 1 and Voyager 2 (which launched the same year) "are the only spacecraft to directly sample interstellar space, which is the region outside the heliosphere — the protective bubble of magnetic fields and solar wind created by the Sun," NASA said in a statement announcing the fix.
The twin probes explored the outer planets of Jupiter, Saturn, Uranus and Neptune before starting their voyages toward interstellar space.

Driving the news: NASA announced last year that Voyager 1 was experiencing problems with its flight data system.

"The spacecraft is receiving and executing commands sent from Earth but not returning useable data," the Jet Propulsion Laboratory, which manages many of NASA's robotic missions, said after the issue emerged in November.

The big picture: In April, the mission team partially resolved the issue by prompting the spacecraft to begin returning engineering data that included information about the health and status of the spacecraft.

They completed the next step of the repair process last month when they beamed a command to the spacecraft to begin resuming sending science data and two of the four science instruments returned to their normal operating modes.
"Two other instruments required some additional work, but now, all four are returning usable science data," according to NASA's statement last week.

What we're watching: "While Voyager 1 is back to conducting science, additional minor work is needed to clean up the effects of the issue," NASA said.

"Among other tasks, engineers will resynchronize timekeeping software in the spacecraft's three onboard computers so they can execute commands at the right time.
"The team will also perform maintenance on the digital tape recorder, which records some data for the plasma wave instrument that is sent to Earth twice per year."

Big thanks to managing editor Scott Rosenberg and to copy editor Carolyn DiPaolo.