Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

Junhyuk Seo; Dasol Choi; Taerim Kim; Won Chul Cha; Minha Kim; Haanju Yoo; Namkee Oh; YongJin Yi; Kye Hwa Lee; Edward Choi

doi:10.2196/58329

Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study

J Med Internet Res. 2024 Nov 20:26:e58329. doi: 10.2196/58329.

Authors

Junhyuk Seo^#^{1

2}, Dasol Choi^#¹, Taerim Kim^{1

3}, Won Chul Cha^{1

3}, Minha Kim³, Haanju Yoo⁴, Namkee Oh⁵, YongJin Yi⁶, Kye Hwa Lee⁷, Edward Choi⁸

Affiliations

¹ Department of Digital Health, Samsung Advanced Institute of Health Sciences and Technology (SAIHST), Sungkyunkwan University, Seoul, Republic of Korea.
² Department of Nursing, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea.
³ Department of Emergency Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea.
⁴ NAVER Digital Healthcare Lab, Seongnam, Republic of Korea.
⁵ Department of Surgery, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea.
⁶ Department of Internal Medicine, College of Medicine, Dankook University, Cheonan, Republic of Korea.
⁷ Department of Information Medicine, Asan Medical Center and University of Ulsan College of Medicine, Seoul, Republic of Korea.
⁸ Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea.

^# Contributed equally.

PMID: 39566044
DOI: 10.2196/58329

Abstract

Background: The advancement of large language models (LLMs) offers significant opportunities for health care, particularly in the generation of medical documentation. However, challenges related to ensuring the accuracy and reliability of LLM outputs, coupled with the absence of established quality standards, have raised concerns about their clinical application.

Objective: This study aimed to develop and validate an evaluation framework for assessing the accuracy and clinical applicability of LLM-generated emergency department (ED) records, aiming to enhance artificial intelligence integration in health care documentation.

Methods: We organized the Healthcare Prompt-a-thon, a competitive event designed to explore the capabilities of LLMs in generating accurate medical records. The event involved 52 participants who generated 33 initial ED records using HyperCLOVA X, a Korean-specialized LLM. We applied a dual evaluation approach. First, clinical evaluation: 4 medical professionals evaluated the records using a 5-point Likert scale across 5 criteria-appropriateness, accuracy, structure/format, conciseness, and clinical validity. Second, quantitative evaluation: We developed a framework to categorize and count errors in the LLM outputs, identifying 7 key error types. Statistical methods, including Pearson correlation and intraclass correlation coefficients (ICC), were used to assess consistency and agreement among evaluators.

Results: The clinical evaluation demonstrated strong interrater reliability, with ICC values ranging from 0.653 to 0.887 (P<.001), and a test-retest reliability Pearson correlation coefficient of 0.776 (P<.001). Quantitative analysis revealed that invalid generation errors were the most common, constituting 35.38% of total errors, while structural malformation errors had the most significant negative impact on the clinical evaluation score (Pearson r=-0.654; P<.001). A strong negative correlation was found between the number of quantitative errors and clinical evaluation scores (Pearson r=-0.633; P<.001), indicating that higher error rates corresponded to lower clinical acceptability.

Conclusions: Our research provides robust support for the reliability and clinical acceptability of the proposed evaluation framework. It underscores the framework's potential to mitigate clinical burdens and foster the responsible integration of artificial intelligence technologies in health care, suggesting a promising direction for future research and practical applications in the field.

Keywords: artificial intelligence; clinical evaluation; emergency department; health care documentation; large language models; medical record accuracy.

©Junhyuk Seo, Dasol Choi, Taerim Kim, Won Chul Cha, Minha Kim, Haanju Yoo, Namkee Oh, YongJin Yi, Kye Hwa Lee, Edward Choi. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 20.11.2024.

MeSH terms

Artificial Intelligence
Documentation* / methods
Documentation* / standards
Documentation* / statistics & numerical data
Electronic Health Records / standards
Emergency Service, Hospital
Humans
Reproducibility of Results
Republic of Korea