VarBen: Generating in Silico Reference Data Sets for Clinical Next-Generation Sequencing Bioinformatics Pipeline Evaluation

J Mol Diagn. 2021 Mar;23(3):285-299. doi: 10.1016/j.jmoldx.2020.11.010. Epub 2020 Dec 18.

Abstract

Next-generation sequencing is increasingly being adopted as a valuable method for the detection of somatic variants in clinical oncology. However, it is still challenging to reach a satisfactory level of robustness and standardization in clinical practice when using the currently available bioinformatics pipelines to detect variants from raw sequencing data. Moreover, appropriate reference data sets are lacking for clinical bioinformatics pipeline development, validation, and proficiency testing. Herein, we developed the Variant Benchmark tool (VarBen), an open-source software for variant simulation to generate customized reference data sets by directly editing the original sequencing reads. VarBen can introduce a variety of variants, including single-nucleotide variants, small insertions and deletions, and large structural variants, into targeted, exome, or whole-genome sequencing data, and can handle sequencing data from both the Illumina and Ion Torrent sequencing platforms. To demonstrate the feasibility and robustness of VarBen, we performed variant simulation on different sequencing data sets and compared the simulated variants with real-world data. The validation study showed that the simulated data are highly comparable to real-world data and that VarBen is a reliable tool for variant simulation. In addition, our collaborative study of somatic variant calling in 20 laboratories emphasizes the need for laboratories to evaluate their bioinformatics pipelines with customized reference data sets. VarBen may help users develop and validate their bioinformatics pipelines using locally generated sequencing data.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Computational Biology / methods*
  • Computational Biology / standards
  • Genetic Association Studies / methods*
  • Genetic Association Studies / standards
  • Genetic Predisposition to Disease*
  • Genetic Variation*
  • Genome-Wide Association Study / methods
  • Genome-Wide Association Study / standards
  • High-Throughput Nucleotide Sequencing*
  • Humans
  • INDEL Mutation
  • Mutation
  • Polymorphism, Single Nucleotide
  • Reproducibility of Results
  • Software*