Aug 29, 2023 - Technology

ChatGPT plays doctor with 72% success

Illustration of AI elements coming out of a doctor's suit where their head should be.

Illustration: Maura Losch/Axios

As AI capabilities advance in complex medical scenarios that doctors face on a daily basis, the technology remains controversial in medical communities.

The big picture: Doctors are grappling with questions about what counts as an acceptable success rate for AI-supported diagnosis and whether AI's reliability under controlled research conditions will hold up in the real world.

Driving the news: A new study from Mass General Brigham researchers testing ChatGPT's performance on textbook-drawn case studies found the AI bot achieved 72% accuracy in overall clinical decision making, ranging from identifying possible diagnoses to making final diagnoses and care decisions.

Why it matters: AI could ultimately improve both the efficiency and the accuracy of diagnosis as healthcare in the U.S. gets more expensive and complicated as individuals live longer, and the overall population ages.

Details: The Mass General Brigham study is among the first to assess the capacity of large language models across the full scope of clinical care, rather than a single task.

  • The study "comprehensively assesses decision support via ChatGPT from the very beginning of working with a patient through the entire care scenario" including post-diagnosis care management, the report's co-author Marc Succi, executive director at Mass General Brigham's innovation incubator, told Axios.
  • ChatGPT got the final diagnosis right 77% of the time. But in cases requiring "differential diagnosis" — an understanding of all the possible conditions a given set of symptoms might indicate — the bot's success rate dropped to 60%.
  • A second study across 171 hospitals in the U.S. and the Netherlands found that a machine learning model called ELDER-ICU succeeded at identifying the illness severity of older adults admitted to intensive care units, meaning it "can assist clinicians in identification of geriatric ICU patients who need greater or earlier attention."

Be smart: While AI has outperformed medical professionals in some specific tasks, such as cancer detection from medical imaging, many studies of the possible medical uses of AI have yet to translate into real world practice, and some critics argue that AI studies aren't grounded in real clinical needs.

Of note: AI tests in a research setting come with no risk of malpractice lawsuits, unlike humans operating alone or with the assistance of AI in real clinical settings.

What they're saying: Succi, while encouraged by the Mass Brigham study, told Axios there's more work to do to "bridge the gap from a useful machine learning model to actual use in clinical practice."

  • The value of AI assistance to doctors is clearest "in the early stages of patient care when little presenting information (is available) and a list of possible diagnoses is needed," Succi said.
  • "Large language models need to be improved in differential diagnosis before they're ready for prime time," Succi said, adding that researchers should also look at how to apply AI to hospital tasks that do not require final diagnosis, such as emergency room triage.
  • Succi said that ChatGPT is starting to exhibit the capabilities of a newly graduated doctor. But since there are "no real benchmarks" for success rates among doctors at different levels of seniority, he added, judging whether AI is adding value to a doctor's work will remain complicated.

What's next: To allow ChatGPT or comparable AI models to be deployed in hospitals, Succi said that more benchmark research and regulatory guidance is needed, and diagnostic success rates need to rise to between 80% and 90%.

Go deeper