The Potential of Chat-Based Artificial Intelligence Models in Differentiating Between Keloid and Hypertrophic Scars: A Pilot Study

Makoto Shiraishi; Shimpei Miyamoto; Hakuba Takeishi; Daichi Kurita; Kiichi Furuse; Jun Ohba; Yuta Moriwaki; Kou Fujisawa; Mutsumi Okazaki

doi:10.1007/s00266-024-04380-9

The Potential of Chat-Based Artificial Intelligence Models in Differentiating Between Keloid and Hypertrophic Scars: A Pilot Study

Aesthetic Plast Surg. 2024 Sep 25. doi: 10.1007/s00266-024-04380-9. Online ahead of print.

Authors

Makoto Shiraishi¹, Shimpei Miyamoto², Hakuba Takeishi², Daichi Kurita², Kiichi Furuse², Jun Ohba², Yuta Moriwaki², Kou Fujisawa², Mutsumi Okazaki²

Affiliations

¹ Department of Plastic and Reconstructive Surgery, The University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan. shiraishi-kyf@umin.ac.jp.
² Department of Plastic and Reconstructive Surgery, The University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan.

PMID: 39322838
DOI: 10.1007/s00266-024-04380-9

Abstract

Background: Lasting scars such as keloids and hypertrophic scars adversely affect a patient's quality of life. However, these scars are frequently underdiagnosed because of the complexity of the current diagnostic criteria and classification systems. This study aimed to explore the application of Large Language Models (LLMs) such as ChatGPT in diagnosing scar conditions and to propose a more accessible and straightforward diagnostic approach.

Methods: In this study, five artificial intelligence (AI) chatbots, including ChatGPT-4 (GPT-4), Bing Chat (Precise, Balanced, and Creative modes), and Bard, were evaluated for their ability to interpret clinical scar images using a standardized set of prompts. Thirty mock images of various scar types were analyzed, and each chatbot was queried five times to assess the diagnostic accuracy.

Results: GPT-4 had a significantly higher accuracy rate in diagnosing scars than Bing Chat. The overall accuracy rates of GPT-4 and Bing Chat were 36.0% and 22.0%, respectively (P = 0.027), with GPT-4 showing better performance in terms of specificity for keloids (0.6 vs. 0.006) and hypertrophic scars (0.72 vs. 0.0) than Bing Chat.

Conclusions: Although currently available LLMs show potential for use in scar diagnostics, the current technology is still under development and is not yet sufficient for clinical application standards, highlighting the need for further advancements in AI for more accurate medical diagnostics.

Level of evidence iv: This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online instructions to authors www.springer.com/00266 .

Keywords: Artificial intelligence; ChatGPT; Hypertrophic scars; Keloids; Large language models.