Loading…

Feature extraction approaches for biological sequences: a comparative study of mathematical features

As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although machine learning algorithms have been successfully applied to a large number of genomic sequence-related problems, the results are largely affected by the type and...

Full description

Saved in:
Bibliographic Details
Published in:Briefings in Bioinformatics 2021-09, Vol.22 (5)
Main Authors: Bonidia, Robson P, Sampaio, Lucas D H, Domingues, Douglas S, Paschoal, Alexandre R, Lopes, Fabrício M, de Carvalho, André C P L F, Sanches, Danilo S
Format: Article
Language:English
Citations: Items that this one cites
Items that cite this one
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c354t-3c0cf97b1b9bef988796b7e8d6fb67c815f8402b4642a4d259b31bd78d480d7b3
cites cdi_FETCH-LOGICAL-c354t-3c0cf97b1b9bef988796b7e8d6fb67c815f8402b4642a4d259b31bd78d480d7b3
container_end_page
container_issue 5
container_start_page
container_title Briefings in Bioinformatics
container_volume 22
creator Bonidia, Robson P
Sampaio, Lucas D H
Domingues, Douglas S
Paschoal, Alexandre R
Lopes, Fabrício M
de Carvalho, André C P L F
Sanches, Danilo S
description As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although machine learning algorithms have been successfully applied to a large number of genomic sequence-related problems, the results are largely affected by the type and number of features extracted. This effect has motivated new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes a new study of feature extraction approaches based on mathematical features (numerical mapping with Fourier, entropy and complex networks). As a case study, we analyze long non-coding RNA sequences. Moreover, we separated this work into three studies. First, we assessed our proposal with the most addressed problem in our review, e.g. lncRNA and mRNA; second, we also validate the mathematical features in different classification problems, to predict the class of lncRNA, e.g. circular RNAs sequences; third, we analyze its robustness in scenarios with imbalanced data. The experimental results demonstrated three main contributions: first, an in-depth study of several mathematical features; second, a new feature extraction pipeline; and third, its high performance and robustness for distinct RNA sequence classification. Availability:  https://github.com/Bonidia/FeatureExtraction_BiologicalSequences.
doi_str_mv 10.1093/bib/bbab011
format article
fullrecord <record><control><sourceid>proquest_COVID</sourceid><recordid>TN_cdi_proquest_miscellaneous_2489602845</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2489275796</sourcerecordid><originalsourceid>FETCH-LOGICAL-c354t-3c0cf97b1b9bef988796b7e8d6fb67c815f8402b4642a4d259b31bd78d480d7b3</originalsourceid><addsrcrecordid>eNpdkM9LwzAYhoMobk5P3iXgRZC6pEma1JsMp8LAi55Lkiauo21mkor7781-6MHL932Hh_d7eQC4xOgOo5JMVaOmSkmFMD4CY0w5zyhi9Hh7FzxjtCAjcBbCCqEccYFPwYgQJliJ0RjUcyPj4A0039FLHRvXQ7leeyf10gRonYeqca37aLRsYTCfg-m1CfdQQu26tfQyNl8GhjjUG-gs7GRcmjR2uN1nh3NwYmUbzMVhT8D7_PFt9pwtXp9eZg-LTBNGY0Y00rbkCqtSGVsKwctCcSPqwqqCa4GZFRTlihY0l7TOWakIVjUXNRWo5opMwM0-N_VPRUOsuiZo07ayN24IVU5FWaBcUJbQ63_oyg2-T-12VM5Zep6o2z2lvQvBG1utfdNJv6kwqrbyqyS_OshP9NUhc1Cdqf_YX9vkB2tmgdc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2489275796</pqid></control><display><type>article</type><title>Feature extraction approaches for biological sequences: a comparative study of mathematical features</title><source>Coronavirus Research Database</source><creator>Bonidia, Robson P ; Sampaio, Lucas D H ; Domingues, Douglas S ; Paschoal, Alexandre R ; Lopes, Fabrício M ; de Carvalho, André C P L F ; Sanches, Danilo S</creator><creatorcontrib>Bonidia, Robson P ; Sampaio, Lucas D H ; Domingues, Douglas S ; Paschoal, Alexandre R ; Lopes, Fabrício M ; de Carvalho, André C P L F ; Sanches, Danilo S</creatorcontrib><description>As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although machine learning algorithms have been successfully applied to a large number of genomic sequence-related problems, the results are largely affected by the type and number of features extracted. This effect has motivated new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes a new study of feature extraction approaches based on mathematical features (numerical mapping with Fourier, entropy and complex networks). As a case study, we analyze long non-coding RNA sequences. Moreover, we separated this work into three studies. First, we assessed our proposal with the most addressed problem in our review, e.g. lncRNA and mRNA; second, we also validate the mathematical features in different classification problems, to predict the class of lncRNA, e.g. circular RNAs sequences; third, we analyze its robustness in scenarios with imbalanced data. The experimental results demonstrated three main contributions: first, an in-depth study of several mathematical features; second, a new feature extraction pipeline; and third, its high performance and robustness for distinct RNA sequence classification. Availability:  https://github.com/Bonidia/FeatureExtraction_BiologicalSequences.</description><identifier>ISSN: 1467-5463</identifier><identifier>EISSN: 1477-4054</identifier><identifier>DOI: 10.1093/bib/bbab011</identifier><identifier>PMID: 33585910</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><ispartof>Briefings in Bioinformatics, 2021-09, Vol.22 (5)</ispartof><rights>The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.</rights><rights>2021. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the associated terms available at https://academic.oup.com/journals/pages/coronavirus .</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c354t-3c0cf97b1b9bef988796b7e8d6fb67c815f8402b4642a4d259b31bd78d480d7b3</citedby><cites>FETCH-LOGICAL-c354t-3c0cf97b1b9bef988796b7e8d6fb67c815f8402b4642a4d259b31bd78d480d7b3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2489275796?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,38516,43895</link.rule.ids><linktorsrc>$$Uhttps://www.proquest.com/docview/2489275796?pq-origsite=primo$$EView_record_in_ProQuest$$FView_record_in_$$GProQuest</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/33585910$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Bonidia, Robson P</creatorcontrib><creatorcontrib>Sampaio, Lucas D H</creatorcontrib><creatorcontrib>Domingues, Douglas S</creatorcontrib><creatorcontrib>Paschoal, Alexandre R</creatorcontrib><creatorcontrib>Lopes, Fabrício M</creatorcontrib><creatorcontrib>de Carvalho, André C P L F</creatorcontrib><creatorcontrib>Sanches, Danilo S</creatorcontrib><title>Feature extraction approaches for biological sequences: a comparative study of mathematical features</title><title>Briefings in Bioinformatics</title><addtitle>Brief Bioinform</addtitle><description>As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although machine learning algorithms have been successfully applied to a large number of genomic sequence-related problems, the results are largely affected by the type and number of features extracted. This effect has motivated new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes a new study of feature extraction approaches based on mathematical features (numerical mapping with Fourier, entropy and complex networks). As a case study, we analyze long non-coding RNA sequences. Moreover, we separated this work into three studies. First, we assessed our proposal with the most addressed problem in our review, e.g. lncRNA and mRNA; second, we also validate the mathematical features in different classification problems, to predict the class of lncRNA, e.g. circular RNAs sequences; third, we analyze its robustness in scenarios with imbalanced data. The experimental results demonstrated three main contributions: first, an in-depth study of several mathematical features; second, a new feature extraction pipeline; and third, its high performance and robustness for distinct RNA sequence classification. Availability:  https://github.com/Bonidia/FeatureExtraction_BiologicalSequences.</description><issn>1467-5463</issn><issn>1477-4054</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>COVID</sourceid><recordid>eNpdkM9LwzAYhoMobk5P3iXgRZC6pEma1JsMp8LAi55Lkiauo21mkor7781-6MHL932Hh_d7eQC4xOgOo5JMVaOmSkmFMD4CY0w5zyhi9Hh7FzxjtCAjcBbCCqEccYFPwYgQJliJ0RjUcyPj4A0039FLHRvXQ7leeyf10gRonYeqca37aLRsYTCfg-m1CfdQQu26tfQyNl8GhjjUG-gs7GRcmjR2uN1nh3NwYmUbzMVhT8D7_PFt9pwtXp9eZg-LTBNGY0Y00rbkCqtSGVsKwctCcSPqwqqCa4GZFRTlihY0l7TOWakIVjUXNRWo5opMwM0-N_VPRUOsuiZo07ayN24IVU5FWaBcUJbQ63_oyg2-T-12VM5Zep6o2z2lvQvBG1utfdNJv6kwqrbyqyS_OshP9NUhc1Cdqf_YX9vkB2tmgdc</recordid><startdate>20210902</startdate><enddate>20210902</enddate><creator>Bonidia, Robson P</creator><creator>Sampaio, Lucas D H</creator><creator>Domingues, Douglas S</creator><creator>Paschoal, Alexandre R</creator><creator>Lopes, Fabrício M</creator><creator>de Carvalho, André C P L F</creator><creator>Sanches, Danilo S</creator><general>Oxford University Press</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>COVID</scope><scope>7X8</scope></search><sort><creationdate>20210902</creationdate><title>Feature extraction approaches for biological sequences: a comparative study of mathematical features</title><author>Bonidia, Robson P ; Sampaio, Lucas D H ; Domingues, Douglas S ; Paschoal, Alexandre R ; Lopes, Fabrício M ; de Carvalho, André C P L F ; Sanches, Danilo S</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c354t-3c0cf97b1b9bef988796b7e8d6fb67c815f8402b4642a4d259b31bd78d480d7b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Bonidia, Robson P</creatorcontrib><creatorcontrib>Sampaio, Lucas D H</creatorcontrib><creatorcontrib>Domingues, Douglas S</creatorcontrib><creatorcontrib>Paschoal, Alexandre R</creatorcontrib><creatorcontrib>Lopes, Fabrício M</creatorcontrib><creatorcontrib>de Carvalho, André C P L F</creatorcontrib><creatorcontrib>Sanches, Danilo S</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Coronavirus Research Database</collection><collection>MEDLINE - Academic</collection><jtitle>Briefings in Bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bonidia, Robson P</au><au>Sampaio, Lucas D H</au><au>Domingues, Douglas S</au><au>Paschoal, Alexandre R</au><au>Lopes, Fabrício M</au><au>de Carvalho, André C P L F</au><au>Sanches, Danilo S</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Feature extraction approaches for biological sequences: a comparative study of mathematical features</atitle><jtitle>Briefings in Bioinformatics</jtitle><addtitle>Brief Bioinform</addtitle><date>2021-09-02</date><risdate>2021</risdate><volume>22</volume><issue>5</issue><issn>1467-5463</issn><eissn>1477-4054</eissn><abstract>As consequence of the various genomic sequencing projects, an increasing volume of biological sequence data is being produced. Although machine learning algorithms have been successfully applied to a large number of genomic sequence-related problems, the results are largely affected by the type and number of features extracted. This effect has motivated new algorithms and pipeline proposals, mainly involving feature extraction problems, in which extracting significant discriminatory information from a biological set is challenging. Considering this, our work proposes a new study of feature extraction approaches based on mathematical features (numerical mapping with Fourier, entropy and complex networks). As a case study, we analyze long non-coding RNA sequences. Moreover, we separated this work into three studies. First, we assessed our proposal with the most addressed problem in our review, e.g. lncRNA and mRNA; second, we also validate the mathematical features in different classification problems, to predict the class of lncRNA, e.g. circular RNAs sequences; third, we analyze its robustness in scenarios with imbalanced data. The experimental results demonstrated three main contributions: first, an in-depth study of several mathematical features; second, a new feature extraction pipeline; and third, its high performance and robustness for distinct RNA sequence classification. Availability:  https://github.com/Bonidia/FeatureExtraction_BiologicalSequences.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>33585910</pmid><doi>10.1093/bib/bbab011</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1467-5463
ispartof Briefings in Bioinformatics, 2021-09, Vol.22 (5)
issn 1467-5463
1477-4054
language eng
recordid cdi_proquest_miscellaneous_2489602845
source Coronavirus Research Database
title Feature extraction approaches for biological sequences: a comparative study of mathematical features
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T02%3A42%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_COVID&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Feature%20extraction%20approaches%20for%20biological%20sequences:%20a%20comparative%20study%20of%20mathematical%20features&rft.jtitle=Briefings%20in%20Bioinformatics&rft.au=Bonidia,%20Robson%20P&rft.date=2021-09-02&rft.volume=22&rft.issue=5&rft.issn=1467-5463&rft.eissn=1477-4054&rft_id=info:doi/10.1093/bib/bbab011&rft_dat=%3Cproquest_COVID%3E2489275796%3C/proquest_COVID%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c354t-3c0cf97b1b9bef988796b7e8d6fb67c815f8402b4642a4d259b31bd78d480d7b3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2489275796&rft_id=info:pmid/33585910&rfr_iscdi=true