Loading…

A sequence-based prediction of Kruppel-like factors proteins using XGBoost and optimized features

•A novel predictor for accurately identifying KLF proteins using sequence information.•Datasets are manually collected and verified from well-known public sources.•Different sequencing features are extracted and gone through feature selection to find the optimal set.•Among different machine learning...

Full description

Saved in:
Bibliographic Details
Published in:Gene 2021-06, Vol.787, p.145643-145643, Article 145643
Main Authors: Le, Nguyen Quoc Khanh, Do, Duyen Thi, Nguyen, Trinh-Trung-Duong, Le, Quynh Anh
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•A novel predictor for accurately identifying KLF proteins using sequence information.•Datasets are manually collected and verified from well-known public sources.•Different sequencing features are extracted and gone through feature selection to find the optimal set.•Among different machine learning models, performance of XGBoost outperforms the others.•A basis for further research that would like to discover new KLF proteins. Krüppel-like factors (KLF) refer to a group of conserved zinc finger-containing transcription factors that are involved in various physiological and biological processes, including cell proliferation, differentiation, development, and apoptosis. Some bioinformatics methods such as sequence similarity searches, multiple sequence alignment, phylogenetic reconstruction, and gene synteny analysis have also been proposed to broaden our knowledge of KLF proteins. In this study, we proposed a novel computational approach by using machine learning on features calculated from primary sequences. To detail, our XGBoost-based model is efficient in identifying KLF proteins, with accuracy of 96.4% and MCC of 0.704. It also holds a promising performance when testing our model on an independent dataset. Therefore, our model could serve as an useful tool to identify new KLF proteins and provide necessary information for biologists and researchers in KLF proteins. Our machine learning source codes as well as datasets are freely available at https://github.com/khanhlee/KLF-XGB.
ISSN:0378-1119
1879-0038
DOI:10.1016/j.gene.2021.145643