psHarmonize: Facilitating reproducible large-scale pre-statistical data harmonization and documentation in R

Patterns (N Y). 2024 Jun 14;5(8):101003. doi: 10.1016/j.patter.2024.101003. eCollection 2024 Aug 9.

Abstract

Combining pertinent data from multiple studies can increase the robustness of epidemiological investigations. Effective "pre-statistical" data harmonization is paramount to the streamlined conduct of collective, multi-study analysis. Harmonizing data and documenting decisions about the transformations of variables to a common set of categorical values and measurement scales are time consuming and can be error prone, particularly for numerous studies with large quantities of variables. The psHarmonize R package facilitates harmonization by combining multiple datasets, applying data transformation functions, and creating long and wide harmonized datasets. The user provides transformation instructions in a "harmonization sheet" that includes dataset names, variable names, and coding instructions and centrally tracks all decisions. The package performs harmonization, generates error logs as necessary, and creates summary reports of harmonized data. psHarmonize is poised to serve as a central feature of data preparation for the joint analysis of multiple studies.

Keywords: R package; data harmonization; data integration; data management; data pooling.