Loading…

Systematic analyses of AISNPs screening and classification algorithms based on genome-wide data for forensic biogeographic ancestry inference

Identifying the biogeographic ancestral origin of biological sample left at a crime scene can provide important evidence for judicial case, as well as clue for narrowing down suspect. Ancestry informative single nucleotide polymorphism (AISNP) has become one of the most important genetic markers in...

Full description

Saved in:
Bibliographic Details
Published in:Forensic science international 2024-04, Vol.357, p.111975-111975, Article 111975
Main Authors: Cai, Meiming, Lei, Fanzhang, Chen, Man, Lan, Qiong, Wu, Xiaolian, Mao, Chen, Shi, Meisen, Zhu, Bofeng
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Identifying the biogeographic ancestral origin of biological sample left at a crime scene can provide important evidence for judicial case, as well as clue for narrowing down suspect. Ancestry informative single nucleotide polymorphism (AISNP) has become one of the most important genetic markers in recent years for screening ancestry information loci and analyzing the population genetic background and structure due to their high number and wide distributions in the human genome. In this study, based on data from 26 populations in the 1000 Genomes Project Phase 3, a Random Forest classification model was constructed with one-vs-rest classification strategy for embedded feature selection in order to obtain a panel with a small number of efficient AISNPs. The research aim was to clarify differentiations of population genetic structures among continents and subregions of East Asia. ADMIXTURE results showed that based on the 58 AISNPs selected by the machine learning algorithm, the 26 populations involved in the study could be categorized into six intercontinental ancestry components: North East Asia, South East Asia, Africa, Europe, South Asia, and America. The 24 continental-specific AISNPs and 34 East Asian-specific AISNPs were finally obtained, and used to construct the ancestry prediction model using XGBoost algorithm, resulting in the Matthews correlation coefficients of 0.94 and 0.89, and accuracies of 0.94 and 0.92, respectively. The machine learning models that we constructed using population-specific AISNPs were able to accurately predict the ancestral origins of continental and intra-East Asian populations. To summarize, screening a set of high-perform AISNPs to infer biogeographical ancestral information using embedded feature selection has potential application in creating a layered inference system that accurately differentiates from intercontinental populations to local subpopulations. •Embedded feature selection was performed for AISNPs screening using one-vs-rest classification strategy based on RF model.•The selected 34 AISNPs could improve ancestry classifications among five East Asian populations.•ADMIXTURE results showed an optimal K value of 6, indicating that the 26 studied populations were categorized into six ancestral components.•The accuracies of XGBoost model in distinguishing among the five continental and intra-EAS populations were 0.94 and 0.92, respectively.
ISSN:0379-0738
1872-6283
DOI:10.1016/j.forsciint.2024.111975