LLMs and Diagnostic Reasoning: A Randomized Clinical Vignette Study [pdf]

trott · on Oct 2, 2024

TLDR:

Physicians scored 73.7. Physicians armed with GPT-4 scored 76.3. But GPT-4 alone scored 89.2.

The authors think it's unlikely that the materials are in the GPT-4 training data, because the cases have never been publicly released.

panabee · on Oct 9, 2024

thanks for sharing.

the implications are fascinating, if the findings are generalizable and reproducible.

the study suggests LLMs may already be materially superior to experts in a critical field like medicine, and that inexpert users hold back LLMs.

given the author affiliations, it's also likely that the tested physicians are in the top tier -- suggesting even greater disparity between LLMs and doctors in less advanced areas.

panabee · on Oct 10, 2024

a doctor friend highlighted two key limitations: only six cases were evaluated per physician and half the physicians were only residents.