Evaluating the comprehension and accuracy of ChatGPT's responses to diabetes-related questions in Urdu compared to English

Seyreen Faisal; Tafiya Erum Kamran; Rimsha Khalid; Zaira Haider; Yusra Siddiqui; Nadia Saeed; Sunaina Imran; Romaan Faisal; Misbah Jabeen

doi:10.1177/20552076241289730

Evaluating the comprehension and accuracy of ChatGPT's responses to diabetes-related questions in Urdu compared to English

Digit Health. 2024 Oct 17:10:20552076241289730. doi: 10.1177/20552076241289730. eCollection 2024 Jan-Dec.

Authors

Seyreen Faisal¹, Tafiya Erum Kamran¹, Rimsha Khalid¹, Zaira Haider¹, Yusra Siddiqui¹, Nadia Saeed², Sunaina Imran¹, Romaan Faisal³, Misbah Jabeen⁴

Affiliations

¹ Shifa College of Medicine, Shifa Tameer-e-Millat University, Islamabad, Pakistan.
² Department of Internal Medicine, Shifa College of Medicine, Shifa Tameer-e-Millat University, Islamabad, Pakistan.
³ Islamabad Medical and Dental College, Shaheed Zulfiqar Ali Bhutto Medical University, Islamabad, Pakistan.
⁴ Department of Endocrinology, Shifa International Hospital, Islamabad, Pakistan.

Abstract

Introduction: Patients with diabetes require healthcare and information that are accurate and extensive. Large language models (LLMs) like ChatGPT herald the capacity to provide such exhaustive data. To determine (a) the comprehensiveness of ChatGPT's responses in Urdu to diabetes-related questions and (b) the accuracy of ChatGPT's Urdu responses when compared to its English responses.

Methods: A cross-sectional observational study was conducted. Two reviewers experienced in internal medicine and endocrinology graded 53 Urdu and English responses on diabetes knowledge, lifestyle, and prevention. A senior reviewer resolved discrepancies. Responses were assessed for comprehension and accuracy, then compared to English.

Results: Among the Urdu responses generated, only two of 53 (3.8%) questions were graded as comprehensive, and five of 53 (9.4%) were graded as correct but inadequate. We found that 25 of 53 (47.2%) questions were graded as mixed with correct and incorrect/outdated data, the most significant proportion of responses being graded as such. When considering the comparison of response scale grading the comparative accuracy of Urdu and English responses, no Urdu response (0.0%) was considered to have more accuracy than English. Most of the Urdu responses were found to have an accuracy less than that of English, an overwhelming majority of 49 of 53 (92.5%) responses.

Conclusion: We found that although the ability to retrieve such information about diabetes is impressive, it can merely be used as an adjunct instead of a solitary source of information. Further work must be done to optimize Urdu responses in medical contexts to approximate the boundless potential it heralds.

Keywords: Artificial intelligence; chronic disease management; health communication; patient education as a topic; telemedicine.