Penny-Dimri JC., Bachmann M., Cooke WR., Mathewlynn S., Dockree S., Tolladay J., Kossen J., Li L., Gal Y., Jones GD.

Background Large language models (LLMs) offer considerable potential to support clinical decision making. However, the use of LLMs in medicine is limited by the risk of generating fluent but incorrect outputs (hallucinations) that compromise patient safety. Traditional approaches to detecting when LLMs are uncertain fail to detect meaning-level inconsistencies. Semantic entropy has been proposed as a novel uncertainty metric that quantifies variation in meaning across generated responses; however, the metric's utility in clinical applications remains untested. We aimed to compare the ability of semantic entropy with that of perplexity to detect inaccuracies in LLM-generated answers to questions concerning obstetrics and gynaecology. Methods We assessed semantic entropy as a tool for quantifying uncertainty in LLM-generated content using a validated dataset of question–answer pairs derived from the UK Royal College of Obstetricians and Gynaecologists membership (MRCOG) examinations. Responses were generated using GPT-4o (OpenAI [San Francisco, CA, USA]) and entered into the LLM ten times per item. Semantic entropy was computed from the diversity of meaning-based response clusters and compared against perplexity using area under the receiver operating characteristic curve (AUROC) and accuracy (defined as the proportion of model-generated responses classified as correct). Clinical correctness was independently adjudicated by certified obstetricians and gynaecologists for a subset of 105 questions. Findings Of 1824 MRCOG question–answer pairs, we excluded 98 questions that were incompatible with short-answer formatting and 56 that required interpretation of images or tables. The final analysis included 1670 questions. Semantic entropy showed superior discrimination of incorrect responses compared with perplexity (AUROC 0·76 [95% CI 0·75–0·78 vs 0·62 [0·60–0·65]). In expert validation, semantic entropy achieved near-perfect discrimination (0·97 [0·91–1·00]), whereas perplexity performed at chance level (0·57 [0·405–0·68]). Semantic entropy consistently outperformed perplexity across question types, examination parts, and response lengths. Semantic clustering, required for semantic entropy calculation, succeeded in 30% of questions but provided informative uncertainty signals even when incomplete. Discrete semantic entropy, a simplified variant requiring no model internals, yielded similar performance. Interpretation Semantic entropy offers a robust and scalable method to identify uncertain or potentially misleading outputs from LLMs. Semantic entropy could support safer artificial intelligence-assisted clinical decision making by flagging unreliable responses and enabling human oversight. These findings support semantic entropy as a practical safeguard for the deployment of LLMs in clinical environments and establish a methodological framework for future uncertainty estimation in high-risk medical domains. Funding UK Medical Research Council.

Measuring large language model uncertainty in women's health using semantic entropy and perplexity: a comparative study

Penny-Dimri JC., Bachmann M., Cooke WR., Mathewlynn S., Dockree S., Tolladay J., Kossen J., Li L., Gal Y., Jones GD.

DOI

Type

Publisher

Publication Date

Keywords