Evaluating multimodal AI in medical diagnostics

Robert Kaczmarczyk; Theresa Isabelle Wilhelm; Ron Martin; Jonas Roos

doi:10.1038/s41746-024-01208-3

Evaluating multimodal AI in medical diagnostics

NPJ Digit Med. 2024 Aug 7;7(1):205. doi: 10.1038/s41746-024-01208-3.

Authors

Robert Kaczmarczyk¹, Theresa Isabelle Wilhelm², Ron Martin³, Jonas Roos⁴

Affiliations

¹ Department of Dermatology and Allergy, School of Medicine, Technical University of Munich, Munich, Germany.
² Eye Center, Faculty of Medicine, Albert-Ludwigs-University of Freiburg, Freiburg, Germany. theresa.wilhelm@uniklinik-freiburg.de.
³ Clinic of Plastic, Hand and Aesthetic Surgery, Burn Center, BG Clinic Bergmannstrost, Halle (Saale), Germany.
⁴ Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, Bonn, Germany.

Abstract

This study evaluates multimodal AI models' accuracy and responsiveness in answering NEJM Image Challenge questions, juxtaposed with human collective intelligence, underscoring AI's potential and current limitations in clinical diagnostics. Anthropic's Claude 3 family demonstrated the highest accuracy among the evaluated AI models, surpassing the average human accuracy, while collective human decision-making outperformed all AI models. GPT-4 Vision Preview exhibited selectivity, responding more to easier questions with smaller images and longer questions.