Impact of reference design on estimating SARS-CoV-2 lineage abundances from wastewater sequencing data

Gigascience. 2024 Jan 2:13:giae051. doi: 10.1093/gigascience/giae051.

Abstract

Background: Sequencing of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) RNA from wastewater samples has emerged as a valuable tool for detecting the presence and relative abundances of SARS-CoV-2 variants in a community. By analyzing the viral genetic material present in wastewater, researchers and public health authorities can gain early insights into the spread of virus lineages and emerging mutations. Constructing reference datasets from known SARS-CoV-2 lineages and their mutation profiles has become state-of-the-art for assigning viral lineages and their relative abundances from wastewater sequencing data. However, selecting reference sequences or mutations directly affects the predictive power.

Results: Here, we show the impact of a mutation- and sequence-based reference reconstruction for SARS-CoV-2 abundance estimation. We benchmark 3 datasets: (i) synthetic "spike-in"' mixtures; (ii) German wastewater samples from early 2021, mainly comprising Alpha; and (iii) samples obtained from wastewater at an international airport in Germany from the end of 2021, including first signals of Omicron. The 2 approaches differ in sublineage detection, with the marker mutation-based method, in particular, being challenged by the increasing number of mutations and lineages. However, the estimations of both approaches depend on selecting representative references and optimized parameter settings. By performing parameter escalation experiments, we demonstrate the effects of reference size and alternative allele frequency cutoffs for abundance estimation. We show how different parameter settings can lead to different results for our test datasets and illustrate the effects of virus lineage composition of wastewater samples and references.

Conclusions: Our study highlights current computational challenges, focusing on the general reference design, which directly impacts abundance allocations. We illustrate advantages and disadvantages that may be relevant for further developments in the wastewater community and in the context of defining robust quality metrics.

Keywords: SARS-CoV-2; abundance estimation, next-generation sequencing, benchmark; sewage; wastewater.

MeSH terms

  • COVID-19* / epidemiology
  • COVID-19* / virology
  • Genome, Viral
  • Humans
  • Mutation*
  • RNA, Viral / genetics
  • SARS-CoV-2* / genetics
  • SARS-CoV-2* / isolation & purification
  • Wastewater* / virology

Substances

  • Wastewater
  • RNA, Viral

Supplementary concepts

  • SARS-CoV-2 variants