Classification accuracy of machine learning algorithms for Chinese local cattle breeds using genomic markers

Yi Chuan. 2024 Jul;46(7):530-539. doi: 10.16288/j.yczz.24-059.

Abstract

Accurate breed classification is required for the conservation and utilization of farm animal genetic resources. Traditional classification methods mainly rely on phenotypic characterization. However, it is difficult to distinguish between the highly similar breeds due to the challenges in qualifying the phenotypic character. Machine learning algorithms show unique advantages in breed classification using genomic information. To evaluate the classification methods for Chinese cattle breeds, this study utilized genomic SNP data from 213 individuals across seven Chinese local breeds and compared the classification accuracies of three feature selection methods (FST value sorting and screening, mRMR, and Relief-F) and three machine learning algorithms (Random Forest, Support Vector Machine, and Naive Bayes). Results showed that: 1) using the FST method to screen more than 1500 SNPs, or using the mRMR algorithm to screen more than 1000 SNPs, the SVM classification algorithm can achieve more than 99.47% classification accuracy; 2) the most effective algorithm was SVM, followed by NB, while the best SNP selection method was FST and mRMR, followed by Relief-F; 3) species misclassification often occurs between breeds with high similarity. This study demonstrates that machine learning classification models combined with genomic data are effective methods for the classification of local cattle breeds, providing a technical basis for the rapid and accurate classification of cattle breeds in China.

品种分类是畜禽品种遗传资源保护和利用的基础,传统分类方法主要依赖于体型外貌特征判断,但因分类指标不易量化,故难以区分相似度较高的品种。机器学习算法在利用基因组信息进行品种分类方面显示出独特优势。为了探索最适合于中国牛品种的分类方法,本研究使用7个地方品种共213头牛的基因组SNP数据,对比了FST值排序筛选、mRMR、Relief-F三种SNP选择方法和随机森林(Random Forest, RF)、支持向量机(Support Vector Machine, SVM)、朴素贝叶斯(Naive Byes, NB)三种不同机器学习算法对品种分类准确性的影响。结果表明:1)使用FST方法筛选1500个以上SNP,或使用mRMR算法筛选1000个以上SNP,SVM分类算法可以达到99.47%以上的分类准确率;2)分类效果最好的算法是SVM算法,其次是NB算法,而最好的SNP选择方法是FST和mRMR算法,其次是Relief-F;3)品种错误归类情况常出现在相似性较高的品种间。本研究显示机器学习分类模型结合基因组数据是对牛地方品种鉴别的有效方法,为我国牛品种的快速准确分类提供了技术依据。.

Keywords: FST; breed classification; feature selection; machine learning; support vector machine.

MeSH terms

  • Algorithms*
  • Animals
  • Breeding
  • Cattle / genetics
  • China
  • Genetic Markers / genetics
  • Genome / genetics
  • Genomics / methods
  • Machine Learning*
  • Polymorphism, Single Nucleotide*
  • Support Vector Machine

Substances

  • Genetic Markers