Background: The Populations Underrepresented in Mental illness Association Studies (PUMAS) project is attempting to remediate the historical underrepresentation of African and Latin American populations in psychiatric genetics through large-scale genetic association studies of individuals diagnosed with a serious mental illness [SMI, including schizophrenia (SCZ), schizoaffective disorder (SZA) bipolar disorder (BP), and severe major depressive disorder (MDD)] and matched controls. Given growing evidence indicating substantial symptomatic and genetic overlap between these diagnoses, we sought to enable transdiagnostic genetic analyses of PUMAS data by conducting phenotype alignment and harmonization for 89,320 participants (48,165 cases and 41,155 controls) from four cohorts, each of which used different ascertainment and assessment methods: PAISA n=9,105; PUMAS-LATAM n=14,638; NGAP n=42,953 and GPC n=22,624. As we describe here, these efforts have yielded harmonized datasets enabling us to analyze PUMAS genetic variation data at three levels: SMI overall, diagnoses, and individual symptoms.
Methods: In aligning item-level phenotypes obtained from 14 different clinical instruments, we incorporated content, branching nature, and time frame for each phenotype; standardized diagnoses; and selected 19 core SMI item-level phenotypes for analyses. The harmonization was evaluated in PUMAS cases using multiple correspondence analysis (MCA), co-occurrence analyses, and item-level endorsement.
Outcomes: We mapped >6,895 item-level phenotypes in the aggregated PUMAS data, in which SCZ (44.97%) and severe BP (BP-I, 31.53%) were the most common diagnoses. Twelve of the 19 core item-level phenotypes occurred at frequencies of > 10% across all diagnoses, indicating their potential utility for transdiagnostic genetic analyses. MCA of the 14 phenotypes that were present for all cohorts revealed consistency across cohorts, and placed MDD and SCZ into separate clusters, while other diagnoses showed no significant phenotypic clustering.
Interpretation: Our alignment strategy effectively aggregated extensive phenotypic data obtained using diverse assessment tools. The MCA yielded dimensional scores which we will use for genetic analyses along with the item level phenotypes. After successful harmonization, residual phenotypic heterogeneity between cohorts reflects differences in branching structure of diagnostic instruments, recruitment strategies, and symptom interpretation (due to cultural variation).