Loading…

Non-coding RNA identification with pseudo RNA sequences and feature representation learning

Distinguishing non-coding RNAs (ncRNAs) from coding RNAs is very important in bioinformatics. Although many methods have been proposed for solving this task, it remains highly challenging to further improve the accuracy of ncRNA identification. In this paper, we propose a coding potential predictor...

Full description

Saved in:
Bibliographic Details
Published in:Computers in biology and medicine 2023-10, Vol.165, p.107355-107355, Article 107355
Main Authors: Chen, Xian-gan, Yang, Xiaofei, Li, Chenhong, Lin, Xianguang, Zhang, Wen
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Distinguishing non-coding RNAs (ncRNAs) from coding RNAs is very important in bioinformatics. Although many methods have been proposed for solving this task, it remains highly challenging to further improve the accuracy of ncRNA identification. In this paper, we propose a coding potential predictor using feature representation learning based on pseudo RNA sequences named CPPFLPS. In this method, we use the pseudo RNA sequences generated by simulating RNA sequence mutations as new samples for data augmentation, and six string operations simulating RNA sequence mutations are considered: base replacement, base insertion, base deletion, subsequence reversion, subsequence repetition and subsequence deletion. In the feature representation learning framework, different types of pseudo RNA sequences are added to the training set to form new training sets that can be used to train baseline classifiers, thus obtaining baseline models. The resulting labels of these baseline models are used as feature vectors to represent RNA sequences, and the resulting feature vectors acquired after feature selection are used to train a predictive model for distinguishing ncRNAs from coding RNAs. Our method achieves better performance compared with that of existing state-of-the-art methods. The implementation of the proposed method is available at https://github.com/chenxgscuec/CPPFLPS. •Data augmentation in the data space is used to alleviate the local data imbalances in RNA sequences with sORFs.•The pseudo RNA sequences are used as new samples for data augmentation.•Six string operations simulating RNA sequence mutations are considered in the pseudo RNA sequence generation process.•A novel feature representation learning framework is established to achieve further improved ncRNA identification.
ISSN:0010-4825
1879-0534
DOI:10.1016/j.compbiomed.2023.107355