Analysis of 2019 Ohio Disease Intervention Specialist (DIS) Records for Syphilis Cases Using Clustering Algorithms

Payal Chakraborty; Xia Ning; Mary McNeill; David M Kline; Abigail B Shoben; William C Miller; Abigail Norris Turner

doi:10.1097/OLQ.0000000000002091

Analysis of 2019 Ohio Disease Intervention Specialist (DIS) Records for Syphilis Cases Using Clustering Algorithms

Sex Transm Dis. 2024 Oct 31. doi: 10.1097/OLQ.0000000000002091. Online ahead of print.

Authors

Payal Chakraborty, Xia Ning, Mary McNeill¹, David M Kline², Abigail B Shoben³, William C Miller⁴, Abigail Norris Turner

Affiliations

¹ Ohio Department of Health, Columbus, OH, USA.
² Division of Public Health Sciences, Department of Biostatistics and Data Science, Wake Forest School of Medicine, Winston-Salem, NC, USA.
³ Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA.
⁴ Department of Epidemiology, Gillings School of Public Health, University of North Carolina Chapel Hill, Chapel Hill, NC, USA.

PMID: 39481010
DOI: 10.1097/OLQ.0000000000002091

Abstract

Background: Developments in natural language processing (NLP) and unsupervised machine learning methodologies (e.g., clustering) have given researchers new tools to analyze both structured and unstructured health data. We applied these methods to 2019 Ohio disease intervention specialist (DIS) syphilis records, to determine whether these methods can uncover novel patterns of co-occurrence of individual characteristics, risk factors, and clinical characteristics of syphilis that are not yet reported in the literature.

Methods: The 2019 DIS syphilis records (n=1,996) contain both structured data (categorical and numerical variables) and unstructured notes. In the structured data, we examined case demographics, syphilis risk factors, and clinical characteristics of syphilis. For the unstructured text, we applied TF-IDF (term frequency multiplied by inverse document frequency) weights, a common way to convert text into numerical representations. We performed agglomerative clustering with cosine similarity using the CLUTO software.

Results: The cluster analysis yielded six clusters of syphilis cases based on patterns in the structured and unstructured data. The average internal similarities were much higher than the average external similarities, indicating that the clusters were well-formed. The factors underlying three of the clusters related to patterns of missing data. The factors underlying the other three clusters were sexual behaviors and partnerships. Notably, one of the three consisted of individuals who reported oral sex with male or anonymous partners while intoxicated, and one was comprised mainly of males who have sex with females.

Conclusions: Our analysis resulted in clusters that were well-formed mathematically, but did not reveal novel epidemiological information about syphilis risk factors or transmission that were not already known.