Background: Lasting scars such as keloids and hypertrophic scars adversely affect a patient's quality of life. However, these scars are frequently underdiagnosed because of the complexity of the current diagnostic criteria and classification systems. This study aimed to explore the application of Large Language Models (LLMs) such as ChatGPT in diagnosing scar conditions and to propose a more accessible and straightforward diagnostic approach.
Methods: In this study, five artificial intelligence (AI) chatbots, including ChatGPT-4 (GPT-4), Bing Chat (Precise, Balanced, and Creative modes), and Bard, were evaluated for their ability to interpret clinical scar images using a standardized set of prompts. Thirty mock images of various scar types were analyzed, and each chatbot was queried five times to assess the diagnostic accuracy.
Results: GPT-4 had a significantly higher accuracy rate in diagnosing scars than Bing Chat. The overall accuracy rates of GPT-4 and Bing Chat were 36.0% and 22.0%, respectively (P = 0.027), with GPT-4 showing better performance in terms of specificity for keloids (0.6 vs. 0.006) and hypertrophic scars (0.72 vs. 0.0) than Bing Chat.
Conclusions: Although currently available LLMs show potential for use in scar diagnostics, the current technology is still under development and is not yet sufficient for clinical application standards, highlighting the need for further advancements in AI for more accurate medical diagnostics.
Level of evidence iv: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online instructions to authors www.springer.com/00266 .
Keywords: Artificial intelligence; ChatGPT; Hypertrophic scars; Keloids; Large language models.
© 2024. Springer Science+Business Media, LLC, part of Springer Nature and International Society of Aesthetic Plastic Surgery.