Loading…

CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts

The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often smal...

Full description

Saved in:
Bibliographic Details
Published in:BMC genomics 2015-03, Vol.16 (1), p.170-170, Article 170
Main Authors: Testa, Alison C, Hane, James K, Ellwood, Simon R, Oliver, Richard P
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c636t-2087de8ab18785951a424c47cba00de0da33d1727b1b3b3483f8e82e4eb9adf23
cites cdi_FETCH-LOGICAL-c636t-2087de8ab18785951a424c47cba00de0da33d1727b1b3b3483f8e82e4eb9adf23
container_end_page 170
container_issue 1
container_start_page 170
container_title BMC genomics
container_volume 16
creator Testa, Alison C
Hane, James K
Ellwood, Simon R
Oliver, Richard P
description The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annot
doi_str_mv 10.1186/s12864-015-1344-4
format article
fullrecord <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4363200</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A541365399</galeid><sourcerecordid>A541365399</sourcerecordid><originalsourceid>FETCH-LOGICAL-c636t-2087de8ab18785951a424c47cba00de0da33d1727b1b3b3483f8e82e4eb9adf23</originalsourceid><addsrcrecordid>eNptkktv1DAUhS0EomXgB7BBltjAIsWvOB4WlUYjHpUKiAJry7FvMobEntpJxfx7PEypOhLywo_7nSPfq4PQc0rOKFXyTaZMSVERWleUC1GJB-iUioZWjErx8N75BD3J-SchtFGsfoxOWK1UU0t-isZ1dD70X2eT0u4t3vh-M-ywsXZOZoJydw4C_mTSr3iDx-hgwD0EwNsEztvJx4B9wN0cevO3EkfIeM7FEl99XlUZrvGUTMg2-e2Un6JHnRkyPLvdF-jH-3ff1x-ryy8fLtary8pKLqeKEdU4UKalqlH1sqZGMGFFY1tDiAPiDOeONqxpactbLhTvFCgGAtqlcR3jC3R-8N3O7QjOQiifGPQ2-dGknY7G6-NK8BvdxxstuOSMkGLw6tYgxesZ8qRHny0MgwkQ56ypbIRcUk7rgr48oGUCoH3oYnG0e1yvakG5rPlyWaiz_1BlORi9jQE6X96PBK-PBIWZ4PfUmzlnffHt6pilB9ammHOC7q5TSvQ-KfqQFF2SovdJKX0u0Iv7I7pT_IsG_wOOu7mX</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1674691315</pqid></control><display><type>article</type><title>CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts</title><source>Open Access: PubMed Central</source><source>Publicly Available Content (ProQuest)</source><creator>Testa, Alison C ; Hane, James K ; Ellwood, Simon R ; Oliver, Richard P</creator><creatorcontrib>Testa, Alison C ; Hane, James K ; Ellwood, Simon R ; Oliver, Richard P</creatorcontrib><description>The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.</description><identifier>ISSN: 1471-2164</identifier><identifier>EISSN: 1471-2164</identifier><identifier>DOI: 10.1186/s12864-015-1344-4</identifier><identifier>PMID: 25887563</identifier><language>eng</language><publisher>England: BioMed Central Ltd</publisher><subject>Gene Expression Profiling ; Genes, Fungal ; Genome, Fungal ; Markov Chains ; Models, Genetic ; Molecular Sequence Annotation - methods ; Saccharomyces cerevisiae - genetics ; Schizosaccharomyces - genetics ; Sequence Analysis, RNA ; Software</subject><ispartof>BMC genomics, 2015-03, Vol.16 (1), p.170-170, Article 170</ispartof><rights>COPYRIGHT 2015 BioMed Central Ltd.</rights><rights>Testa et al.; licensee BioMed Central. 2015</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c636t-2087de8ab18785951a424c47cba00de0da33d1727b1b3b3483f8e82e4eb9adf23</citedby><cites>FETCH-LOGICAL-c636t-2087de8ab18785951a424c47cba00de0da33d1727b1b3b3483f8e82e4eb9adf23</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4363200/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4363200/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,27903,27904,36992,53769,53771</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/25887563$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Testa, Alison C</creatorcontrib><creatorcontrib>Hane, James K</creatorcontrib><creatorcontrib>Ellwood, Simon R</creatorcontrib><creatorcontrib>Oliver, Richard P</creatorcontrib><title>CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts</title><title>BMC genomics</title><addtitle>BMC Genomics</addtitle><description>The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.</description><subject>Gene Expression Profiling</subject><subject>Genes, Fungal</subject><subject>Genome, Fungal</subject><subject>Markov Chains</subject><subject>Models, Genetic</subject><subject>Molecular Sequence Annotation - methods</subject><subject>Saccharomyces cerevisiae - genetics</subject><subject>Schizosaccharomyces - genetics</subject><subject>Sequence Analysis, RNA</subject><subject>Software</subject><issn>1471-2164</issn><issn>1471-2164</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><recordid>eNptkktv1DAUhS0EomXgB7BBltjAIsWvOB4WlUYjHpUKiAJry7FvMobEntpJxfx7PEypOhLywo_7nSPfq4PQc0rOKFXyTaZMSVERWleUC1GJB-iUioZWjErx8N75BD3J-SchtFGsfoxOWK1UU0t-isZ1dD70X2eT0u4t3vh-M-ywsXZOZoJydw4C_mTSr3iDx-hgwD0EwNsEztvJx4B9wN0cevO3EkfIeM7FEl99XlUZrvGUTMg2-e2Un6JHnRkyPLvdF-jH-3ff1x-ryy8fLtary8pKLqeKEdU4UKalqlH1sqZGMGFFY1tDiAPiDOeONqxpactbLhTvFCgGAtqlcR3jC3R-8N3O7QjOQiifGPQ2-dGknY7G6-NK8BvdxxstuOSMkGLw6tYgxesZ8qRHny0MgwkQ56ypbIRcUk7rgr48oGUCoH3oYnG0e1yvakG5rPlyWaiz_1BlORi9jQE6X96PBK-PBIWZ4PfUmzlnffHt6pilB9ammHOC7q5TSvQ-KfqQFF2SovdJKX0u0Iv7I7pT_IsG_wOOu7mX</recordid><startdate>20150311</startdate><enddate>20150311</enddate><creator>Testa, Alison C</creator><creator>Hane, James K</creator><creator>Ellwood, Simon R</creator><creator>Oliver, Richard P</creator><general>BioMed Central Ltd</general><general>BioMed Central</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>ISR</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20150311</creationdate><title>CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts</title><author>Testa, Alison C ; Hane, James K ; Ellwood, Simon R ; Oliver, Richard P</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c636t-2087de8ab18785951a424c47cba00de0da33d1727b1b3b3483f8e82e4eb9adf23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Gene Expression Profiling</topic><topic>Genes, Fungal</topic><topic>Genome, Fungal</topic><topic>Markov Chains</topic><topic>Models, Genetic</topic><topic>Molecular Sequence Annotation - methods</topic><topic>Saccharomyces cerevisiae - genetics</topic><topic>Schizosaccharomyces - genetics</topic><topic>Sequence Analysis, RNA</topic><topic>Software</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Testa, Alison C</creatorcontrib><creatorcontrib>Hane, James K</creatorcontrib><creatorcontrib>Ellwood, Simon R</creatorcontrib><creatorcontrib>Oliver, Richard P</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Science</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>BMC genomics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Testa, Alison C</au><au>Hane, James K</au><au>Ellwood, Simon R</au><au>Oliver, Richard P</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts</atitle><jtitle>BMC genomics</jtitle><addtitle>BMC Genomics</addtitle><date>2015-03-11</date><risdate>2015</risdate><volume>16</volume><issue>1</issue><spage>170</spage><epage>170</epage><pages>170-170</pages><artnum>170</artnum><issn>1471-2164</issn><eissn>1471-2164</eissn><abstract>The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.</abstract><cop>England</cop><pub>BioMed Central Ltd</pub><pmid>25887563</pmid><doi>10.1186/s12864-015-1344-4</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1471-2164
ispartof BMC genomics, 2015-03, Vol.16 (1), p.170-170, Article 170
issn 1471-2164
1471-2164
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_4363200
source Open Access: PubMed Central; Publicly Available Content (ProQuest)
subjects Gene Expression Profiling
Genes, Fungal
Genome, Fungal
Markov Chains
Models, Genetic
Molecular Sequence Annotation - methods
Saccharomyces cerevisiae - genetics
Schizosaccharomyces - genetics
Sequence Analysis, RNA
Software
title CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-26T02%3A45%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CodingQuarry:%20highly%20accurate%20hidden%20Markov%20model%20gene%20prediction%20in%20fungal%20genomes%20using%20RNA-seq%20transcripts&rft.jtitle=BMC%20genomics&rft.au=Testa,%20Alison%20C&rft.date=2015-03-11&rft.volume=16&rft.issue=1&rft.spage=170&rft.epage=170&rft.pages=170-170&rft.artnum=170&rft.issn=1471-2164&rft.eissn=1471-2164&rft_id=info:doi/10.1186/s12864-015-1344-4&rft_dat=%3Cgale_pubme%3EA541365399%3C/gale_pubme%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c636t-2087de8ab18785951a424c47cba00de0da33d1727b1b3b3483f8e82e4eb9adf23%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1674691315&rft_id=info:pmid/25887563&rft_galeid=A541365399&rfr_iscdi=true