Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox

Jakob Wirbel; Konrad Zych; Morgan Essex; Nicolai Karcher; Ece Kartal; Guillem Salazar; Peer Bork; Shinichi Sunagawa; Georg Zeller

doi:10.1186/s13059-021-02306-1

Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox

Genome Biol. 2021 Mar 30;22(1):93. doi: 10.1186/s13059-021-02306-1.

Authors

Jakob Wirbel¹, Konrad Zych^{1

2}, Morgan Essex^{1

3}, Nicolai Karcher^{1

4}, Ece Kartal¹, Guillem Salazar⁵, Peer Bork^{1

6

7

8}, Shinichi Sunagawa⁵, Georg Zeller⁹

Affiliations

¹ Structural and Computational Biology Unit, European Molecular Biology Laboratory (EMBL), 69117, Heidelberg, Germany.
² Present Address: Clinical Microbiomics A/S, Ole Maaløes Vej 3, 2200, København, Denmark.
³ Present Address: Experimental and Clinical Research Center (ECRC) of the Max Delbrück Center for Molecular Medicine and Charité University Hospital, 13125, Berlin, Germany.
⁴ Department CIBIO, University of Trento, 38123, Trento, Italy.
⁵ Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093, Zürich, Switzerland.
⁶ Molecular Medicine Partnership Unit, Heidelberg, Germany.
⁷ Max Delbrück Centre for Molecular Medicine, 13125, Berlin, Germany.
⁸ Department of Bioinformatics, Biocenter, University of Würzburg, 97074, Würzburg, Germany.
⁹ Structural and Computational Biology Unit, European Molecular Biology Laboratory (EMBL), 69117, Heidelberg, Germany. zeller@embl.de.

Abstract

The human microbiome is increasingly mined for diagnostic and therapeutic biomarkers using machine learning (ML). However, metagenomics-specific software is scarce, and overoptimistic evaluation and limited cross-study generalization are prevailing issues. To address these, we developed SIAMCAT, a versatile R toolbox for ML-based comparative metagenomics. We demonstrate its capabilities in a meta-analysis of fecal metagenomic studies (10,803 samples). When naively transferred across studies, ML models lost accuracy and disease specificity, which could however be resolved by a novel training set augmentation strategy. This reveals some biomarkers to be disease-specific, with others shared across multiple conditions. SIAMCAT is freely available from siamcat.embl.de .

Keywords: Machine learning; Meta-analysis; Microbiome data analysis; Microbiome-wide association studies (MWAS); Statistical modeling.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Computational Biology / methods*
Confounding Factors, Epidemiologic
Crohn Disease / etiology
Databases, Genetic
Gastrointestinal Microbiome
Humans
Machine Learning*
Meta-Analysis as Topic
Metagenome*
Metagenomics / methods*
Microbiota*
Models, Statistical
ROC Curve
Software*
Workflow