Loading…

A comparative analysis of machine learning classifiers for predicting protein-binding nucleotides in RNA sequences

[Display omitted] •RNA are master players in various cellular and biological processes and RNA-protein interactions are vital for proper functioning of cellular machineries.•Knowledge of binding sites is crucial to decipher their functional implications.•RNA NC-triplet and NC-quartet features could...

Full description

Saved in:
Bibliographic Details
Published in:Computational and structural biotechnology journal 2022-01, Vol.20, p.3195-3207
Main Authors: Agarwal, Ankita, Singh, Kunal, Kant, Shri, Bahadur, Ranjit Prasad
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:[Display omitted] •RNA are master players in various cellular and biological processes and RNA-protein interactions are vital for proper functioning of cellular machineries.•Knowledge of binding sites is crucial to decipher their functional implications.•RNA NC-triplet and NC-quartet features could give reasonably high performance.•RF model outperformed other machine learning classifiers with 85% accuracy and 0.93 AUC and performed better than few existing methods.•An online webserver “Nucpred” is developed with trained model and freely accessible for scientific community. RNA-protein interactions play vital roles in driving the cellular machineries. Despite significant involvement in several biological processes, the underlying molecular mechanism of RNA-protein interactions is still elusive. This may be due to the experimental difficulties in solving co-crystallized RNA-protein complexes. Inherent flexibility of RNA molecules to adopt different conformations makes them functionally diverse. Their interactions with protein have implications in RNA disease biology. Thus, study of binding interfaces can provide a mechanistic insight of the molecular functioning and aberrations caused due to altered interactions. Moreover, high-throughput sequencing technologies have generated huge sequence data compared to available structural data of RNA-protein complexes. In such a scenario, efficient computational algorithms are required for identification of protein-binding interfaces of RNA in the absence of known structures. We have investigated several machine learning classifiers and various features derived from nucleotide sequences to identify protein-binding nucleotides in RNA. We achieve best performance with nucleotide-triplet and nucleotide-quartet feature-based random forest models. An overall accuracy of 84.8%, sensitivity of 83.2%, specificity of 86.1%, MCC of 0.70 and AUC of 0.93 is achieved. We have further implemented the developed models in a user-friendly webserver “Nucpred”, which is freely accessible at “http://www.csb.iitkgp.ac.in/applications/Nucpred/index”.
ISSN:2001-0370
2001-0370
DOI:10.1016/j.csbj.2022.06.036