ChatGPT: A Powerful Tool for Medical Insights?
ChatGPT is a type of AI known as a Large Language Model (LLM). It can quickly analyse large amounts of data and produce coherent responses, making it a promising tool for healthcare. Imagine a system that can read and understand medical books, articles, and websites, then help doctors and patients by providing quick, relevant information. This capability could fill critical gaps in healthcare, offering support in areas that need fast and accurate information.
One area that could particularly benefit is women’s health, specifically Obstetrics and Gynaecology (O&G). This field has long faced challenges in diagnosis and treatment. AI like ChatGPT could help by analysing patient histories, test results, and other medical data to assist in early and accurate diagnoses. It could also personalise treatment plans by predicting the best interventions for individual patients, potentially improving outcomes and making healthcare knowledge more accessible.
Risks and Limitations of AI in Medical Settings
However, the use of ChatGPT in healthcare isn't without risks. Despite its impressive capabilities, ChatGPT has shown worrying limitations. It can generate responses that sound convincing but are factually incorrect—a phenomenon known as "hallucinations." This can be particularly dangerous in a medical context, where accurate information is crucial. ChatGPT also doesn't explain its reasoning, which raises concerns about safety and reliability. Ethical issues also come into play, such as bias, information privacy, and accountability.
Findings from Oxford Research on ChatGPT's Performance
In a recent publication in NPJ Women's Health, researchers from the Nuffield Department of Women's and Reproductive Health at the University of Oxford examined ChatGPT's accuracy and reliability in the field of O&G. They tested ChatGPT with over 1,500 questions from the gold standard tests by the Royal College of Obstetricians and Gynaecologists (RCOG). The findings were both insightful and crucial.
The RCOG exams are internationally recognised for their rigorous standards and are taken over several years of advanced medical training after the candidate is already a certified medical practitioner. Part One of the exam evaluates trainees' foundational scientific knowledge across domains such as biology, anatomy, diseases, and clinical management. Think of Part One as asking, "What is the capital of France?"—it tests basic factual knowledge. Part Two, on the other hand, tests candidates at a more advanced level, assessing their ability to reason clinically after they have completed several more years of training. It is more like asking, "How would you plan a diplomatic mission to improve relations between France and another country?"—it requires applying knowledge to complex, real-world scenarios.
When researchers tested ChatGPT in areas requiring simple rote learning, such as "what are the symptoms of pre-eclampsia" (a condition in pregnancy marked by high blood pressure and potential damage to organs like the liver and kidneys), ChatGPT performed commendably, providing the correct response approximately 70% of the time. However, when the questions demanded clinical reasoning, such as deciding on the best course of action for a pregnant woman showing signs of placental abruption (a serious condition where the placenta detaches from the womb before birth, causing severe bleeding and risking the baby's oxygen and nutrient supply as well as the mother's health), ChatGPT’s performance significantly declined, answering correctly only 50% of the time.
Oxford researchers then tested ChatGPT to determine whether it had any awareness that it was providing incorrect responses (hallucinations) by measuring how confident ChatGPT was in its responses, regardless of whether the information it was providing was accurate or not. Their results showed that ChatGPT was just as confident when hallucinating as when it was correct.
These findings have profound implications for the role of current AI tools such as ChatGPT in healthcare. This study has revealed that while ChatGPT performs well in basic medical knowledge, akin to answering straightforward factual questions, it struggles significantly with clinical reasoning tasks. When faced with complex scenarios requiring nuanced understanding, ChatGPT’s accuracy dropped sharply. The AI’s high confidence in both correct and incorrect answers also raises serious concerns about its reliability in clinical decision-making entirely.
In light of these findings, researchers believe that while ChatGPT and similar AI models hold substantial potential, they are not yet ready for deployment in medical settings. Their limitations in applying clinical knowledge and reasoning, coupled with a tendency to confidently provide incorrect information, highlight the need for significant improvements and further investigation. As AI technology continues to evolve, ensuring its safety and reliability in healthcare must be a priority. Until then, reliance on such tools for clinical advice remains premature and potentially hazardous.
Authors
- Magdalena Bachmann
- Ioana Duta
- Emily Mazey
- William Cooke
- Manu Vatish
- Gabriel Jones
For more information
Contact Gabriel Jones or visit our data science theme pages.