The 16S rRNA gene has been extensively used as a molecular marker to explore evolutionary relationships and profile microbial composition throughout various environments. Despite its convenience and prevalence, limitations are inevitable. Variable copy numbers, intragenomic heterogeneity, and low taxonomic resolution have caused biases in estimating microbial diversity. Here, analysis of 24,248 complete prokaryotic genomes indicated that the 16S rRNA gene copy number ranged from 1 to 37 in bacteria and 1 to 5 in archaea, and intragenomic heterogeneity was observed in 60% of prokaryotic genomes, most of which were below 1%. The overestimation of microbial diversity caused by intragenomic variation and the underestimation introduced by interspecific conservation were calculated when using full-length or partial 16S rRNA genes. Results showed that, at the 100% threshold, microbial diversity could be overestimated by as much as 156.5% when using the full-length gene. The V4 to V5 region-based analyses introduced the lowest overestimation rate (4.4%) but exhibited slightly lower species resolution than other variable regions under the 97% threshold. For different variable regions, appropriate thresholds rather than the canonical value 97% were proposed for minimizing the risk of splitting a single genome into multiple clusters and lumping together different species into the same cluster. This study has not only updated the 16S rRNA gene copy number and intragenomic variation information for the currently available prokaryotic genomes, but also elucidated the biases in estimating prokaryotic diversity with quantitative data, providing references for choosing amplified regions and clustering thresholds in microbial community surveys. IMPORTANCE Microbial diversity is typically analyzed using marker gene-based methods, of which 16S rRNA gene sequencing is the most widely used approach. However, obtaining an accurate estimation of microbial diversity remains a challenge, due to the intragenomic variation and low taxonomic resolution of 16S rRNA genes. Comprehensive examination of the bias in estimating such prokaryotic diversity using 16S rRNA genes within ever-increasing prokaryotic genomes highlights the importance of the choice of sequencing regions and clustering thresholds based on the specific research objectives.
Keywords: 16S rRNA gene; interspecific conservation; intragenomic heterogeneity; microbial diversity.