Loading…
XGBoost as a reliable machine learning tool for predicting ancestry using autosomal STR profiles - Proof of method
The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were us...
Saved in:
Published in: | Forensic science international : genetics 2025-03, Vol.76, p.103183, Article 103183 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.
•XGBoost achieved 94.28 % accuracy in ancestry prediction for four population groups.•Model accuracy was highest (94 %) with 90,000 profiles, but dropped to 83 % with 500 profiles.•Using 21 genetic markers gave 96.3 % accuracy for three groups and 88.28 % for all four groups.•Near-perfect accuracy was seen for African-American and Asian profiles, with challenges for others.•Kits with more loci (e.g., Investigator 24plex) led to higher model accuracy. |
---|---|
ISSN: | 1872-4973 1878-0326 1878-0326 |
DOI: | 10.1016/j.fsigen.2024.103183 |