May 3, 2023 - Technology

Chat-based search flunks the "show your work" test

Illustration: Shoshana Gordon/Axios

New evidence suggests that generative AI may not be very good at one of the first jobs the industry has given it — answering search queries.

Driving the news: Four publicly available search engines that use AI-generated chat to respond to users' questions build a “facade of trustworthiness" without adequately documenting their answers, a new paper from Stanford University researchers concludes.

Why it matters: For now, AI chatbots are likely to be a better choice for casual inquiry than for serious research, critical info or resolving thorny questions of truth.

What they did: Nelson F. Liu, Tianyi Zhang and Percy Liang manually audited Bing Chat, Neeva AI,, and YouChat between late February and late March.

  • The researchers set out to test the “verifiability” of each search engine, seeing this as a key to trustworthiness.
  • They posed questions ranging from “What is the most nominated film for the Oscars?” to “What changes were made when Trinidad and Tobago gained independence?”

What they found: These tools generally delivered fluent and useful answers — but roughly half contained “unsupported statements or inaccurate citations.” Of the citations that were provided, an average of 1 in 4 did not support their associated sentence.

  • These rates, the report says, are "unacceptably low for systems that are quickly becoming a popular tool for answering user queries and already have millions of users."

Details: Microsoft-owned Bing Chat provided the most accurate citations (89.5% success), among generally dismal numbers.

  • YouChat provided citations for barely 1 in 10 answers it provided (11%).
  • "We are continually improving the product to be more accurate and to include relevant, credible citations," a spokesperson told Axios, and noted that the researchers graded You highest for "fluidity and utility."

Between the lines: A traditional search engine can simply show no results if it can't answer a question, but the chat interface pushes the system to offer answers every time — even when it may not have much to go on.

  • "It wouldn't make sense for a system like Bing Chat to just randomly not respond to some of your messages," Liu, one of the Stanford researchers, told Axios.
  • Three of the four search tools answered more than 99% of the researchers' questions. Only regularly declined to provide answers to questions (22% of the time). Unlike the other systems, Neeva provides a conventional results page with the conversational response as a sort of "cherry on top," in Liu's words.

Of note: “The responses that seem more helpful are often those with more unsupported statements or inaccurate citations,” the report concludes.

  • "This inverse relationship definitely highlights the potential for them to actively mislead folks, versus the usual failure mode for a conventional search engine: that nothing relevant comes up," Liu told Axios.

Yes, but: The researchers discovered the products to be “extremely effective” at obtaining answers that could be directly extracted from existing web pages, and good at offering up balanced lists of pros and cons around a given argument.

The bottom line: The sheer size of the global search ad market — more than $250 billion in 2022 — is reason enough for Microsoft to heavily promote Bing Chat.

Go deeper: Where today's generative AI shines

Editor's note: This story has been updated with comment from

Go deeper