Prompt matters: evaluation of large language model chatbot responses related to Peyronie's disease

Christopher J Warren; Victoria S Edmonds; Nicolette G Payne; Sandeep Voletti; Sarah Y Wu; JennaKay Colquitt; Hossein Sadeghi-Nejad; Nahid Punjani

doi:10.1093/sexmed/qfae055

Prompt matters: evaluation of large language model chatbot responses related to Peyronie's disease

Sex Med. 2024 Sep 9;12(4):qfae055. doi: 10.1093/sexmed/qfae055. eCollection 2024 Aug.

Authors

Christopher J Warren¹, Victoria S Edmonds¹, Nicolette G Payne¹, Sandeep Voletti², Sarah Y Wu², JennaKay Colquitt², Hossein Sadeghi-Nejad³, Nahid Punjani¹

Affiliations

¹ Department of Urology, Mayo Clinic Arizona, Phoenix, AZ 85054, United States.
² Mayo Clinic Alix School of Medicine, Scottsdale, AZ 85259, United States.
³ Department of Urology, New York University, New York, NY 10016, United States.

Abstract

Introduction: Despite direct access to clinicians through the electronic health record, patients are increasingly turning to the internet for information related to their health, especially with sensitive urologic conditions such as Peyronie's disease (PD). Large language model (LLM) chatbots are a form of artificial intelligence that rely on user prompts to mimic conversation, and they have shown remarkable capabilities. The conversational nature of these chatbots has the potential to answer patient questions related to PD; however, the accuracy, comprehensiveness, and readability of these LLMs related to PD remain unknown.

Aims: To assess the quality and readability of information generated from 4 LLMs with searches related to PD; to see if users could improve responses; and to assess the accuracy, completeness, and readability of responses to artificial preoperative patient questions sent through the electronic health record prior to undergoing PD surgery.

Methods: The National Institutes of Health's frequently asked questions related to PD were entered into 4 LLMs, unprompted and prompted. The responses were evaluated for overall quality by the previously validated DISCERN questionnaire. Accuracy and completeness of LLM responses to 11 presurgical patient messages were evaluated with previously accepted Likert scales. All evaluations were performed by 3 independent reviewers in October 2023, and all reviews were repeated in April 2024. Descriptive statistics and analysis were performed.

Results: Without prompting, the quality of information was moderate across all LLMs but improved to high quality with prompting. LLMs were accurate and complete, with an average score of 5.5 of 6.0 (SD, 0.8) and 2.8 of 3.0 (SD, 0.4), respectively. The average Flesch-Kincaid reading level was grade 12.9 (SD, 2.1). Chatbots were unable to communicate at a grade 8 reading level when prompted, and their citations were appropriate only 42.5% of the time.

Conclusion: LLMs may become a valuable tool for patient education for PD, but they currently rely on clinical context and appropriate prompting by humans to be useful. Unfortunately, their prerequisite reading level remains higher than that of the average patient, and their citations cannot be trusted. However, given their increasing uptake and accessibility, patients and physicians should be educated on how to interact with these LLMs to elicit the most appropriate responses. In the future, LLMs may reduce burnout by helping physicians respond to patient messages.

Keywords: Peyronie’s disease; artificial intelligence; chatbot; large language model; patient education.