Computationally Efficient Assembly of Pseudomonas aeruginosa Gene Expression Compendia

Georgia Doing; Alexandra J Lee; Samuel L Neff; Taylor Reiter; Jacob D Holt; Bruce A Stanton; Casey S Greene; Deborah A Hogan

doi:10.1128/msystems.00341-22

Computationally Efficient Assembly of Pseudomonas aeruginosa Gene Expression Compendia

mSystems. 2023 Feb 23;8(1):e0034122. doi: 10.1128/msystems.00341-22. Epub 2022 Dec 21.

Authors

Georgia Doing¹, Alexandra J Lee², Samuel L Neff¹, Taylor Reiter³, Jacob D Holt¹, Bruce A Stanton¹, Casey S Greene^{3

4}, Deborah A Hogan¹

Affiliations

¹ Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA.
² Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
³ Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, Colorado, USA.
⁴ Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

Abstract

Thousands of Pseudomonas aeruginosa RNA sequencing (RNA-seq) gene expression profiles are publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). In this work, the transcriptional profiles from hundreds of studies performed by over 75 research groups were reanalyzed in aggregate to create a powerful tool for hypothesis generation and testing. Raw sequence data were uniformly processed using the Salmon pseudoaligner, and this read mapping method was validated by comparison to a direct alignment method. We developed filtering criteria to exclude samples with aberrant levels of housekeeping gene expression or an unexpected number of genes with no reported values and normalized the filtered compendia using the ratio-of-medians method. The filtering and normalization steps greatly improved gene expression correlations for genes within the same operon or regulon across the 2,333 samples. Since the RNA-seq data were generated using diverse strains, we report the effects of mapping samples to noncognate reference genomes by separately analyzing all samples mapped to cDNA reference genomes for strains PAO1 and PA14, two divergent strains that were used to generate most of the samples. Finally, we developed an algorithm to incorporate new data as they are deposited into the SRA. Our processing and quality control methods provide a scalable framework for taking advantage of the troves of biological information hibernating in the depths of microbial gene expression data and yield useful tools for P. aeruginosa RNA-seq data to be leveraged for diverse research goals. IMPORTANCE Pseudomonas aeruginosa is a causative agent of a wide range of infections, including chronic infections associated with cystic fibrosis. These P. aeruginosa infections are difficult to treat and often have negative outcomes. To aid in the study of this problematic pathogen, we mapped, filtered for quality, and normalized thousands of P. aeruginosa RNA-seq gene expression profiles that were publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The resulting compendia facilitate analyses across experiments, strains, and conditions. Ultimately, the workflow that we present could be applied to analyses of other microbial species.

Keywords: Pseudomonas aeruginosa; RNA-seq; compendium; gene expression; strains; transcriptome.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Cystic Fibrosis* / complications
Humans
Pseudomonas aeruginosa* / genetics
RNA
Transcriptome

Substances

RNA

Abstract

Publication types

MeSH terms

Substances

Grants and funding