Loading…

Quantitative assessment of protein function prediction from metagenomics shotgun sequences

To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred spe...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings of the National Academy of Sciences - PNAS 2007-08, Vol.104 (35), p.13913-13918
Main Authors: Harrington, E.D, Singh, A.H, Doerks, T, Letunic, I, von Mering, C, Jensen, L.J, Raes, J, Bork, P
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c554t-5f7fcefc091477a54f8eacac2e675ce63dadf2f1bc115d4ca24b33acdf0a9dd83
cites cdi_FETCH-LOGICAL-c554t-5f7fcefc091477a54f8eacac2e675ce63dadf2f1bc115d4ca24b33acdf0a9dd83
container_end_page 13918
container_issue 35
container_start_page 13913
container_title Proceedings of the National Academy of Sciences - PNAS
container_volume 104
creator Harrington, E.D
Singh, A.H
Doerks, T
Letunic, I
von Mering, C
Jensen, L.J
Raes, J
Bork, P
description To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.
doi_str_mv 10.1073/pnas.0702636104
format article
fullrecord <record><control><sourceid>jstor_pnas_</sourceid><recordid>TN_cdi_pnas_primary_104_35_13913</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><jstor_id>25436596</jstor_id><sourcerecordid>25436596</sourcerecordid><originalsourceid>FETCH-LOGICAL-c554t-5f7fcefc091477a54f8eacac2e675ce63dadf2f1bc115d4ca24b33acdf0a9dd83</originalsourceid><addsrcrecordid>eNqFkc9rFDEUx4Modq2ePamDB8HDti-TZDJzEaT4Cwoi2ouXkM28bLPMJNskU_S_N8MsXfXSU0K-n_fl-_Il5DmFMwqSne-9TmcgoW5YQ4E_ICsKHV03vIOHZAVQy3XLa35CnqS0A4BOtPCYnFApqYSWrcjPb5P22WWd3S1WOiVMaUSfq2CrfQwZna_s5E12wZcH7N1ytTGM1YhZb9GH0ZlUpeuQt5OvEt5M6A2mp-SR1UPCZ4fzlFx9_PDj4vP68uunLxfvL9dGCJ7Xwkpr0JqSm0upBbctaqNNjY0UBhvW697Wlm4MpaLnRtd8w5g2vQXd9X3LTsm7xXc_bUbsTUkf9aD20Y06_lZBO_Wv4t212oZbRTsh2hqKwZuDQQwle8pqdMngMGiPYUqqaWtKKRf3gsVLsMIW8PV_4C5M0ZdfKAzl0IKYofMFMjGkFNHeRaag5nbV3K46tlsmXv696ZE_1FmA6gDMk0c7rphQlHV0Rt7egyg7DUPGX7mwLxZ2l3KId3AtOGtE1xT91aJbHZTeRpfU1feyIIOyIe1kx_4ATFTP6A</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>201408051</pqid></control><display><type>article</type><title>Quantitative assessment of protein function prediction from metagenomics shotgun sequences</title><source>Open Access: PubMed Central</source><source>JSTOR Archival Journals and Primary Sources Collection</source><creator>Harrington, E.D ; Singh, A.H ; Doerks, T ; Letunic, I ; von Mering, C ; Jensen, L.J ; Raes, J ; Bork, P</creator><creatorcontrib>Harrington, E.D ; Singh, A.H ; Doerks, T ; Letunic, I ; von Mering, C ; Jensen, L.J ; Raes, J ; Bork, P</creatorcontrib><description>To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.</description><identifier>ISSN: 0027-8424</identifier><identifier>EISSN: 1091-6490</identifier><identifier>DOI: 10.1073/pnas.0702636104</identifier><identifier>PMID: 17717083</identifier><language>eng</language><publisher>United States: National Academy of Sciences</publisher><subject>Animals ; Biochemistry ; Biofilms ; Biological Sciences ; Biosynthesis ; Cogs ; Databases, Factual ; Datasets ; Fatty acids ; Genes ; Genetic Variation ; Genome ; Genome, Bacterial ; Genomes ; Genomic Library ; Genomics ; Hemoglobin ; Metagenomics ; Models, Genetic ; Open Reading Frames ; Proteins ; Proteins - genetics ; Proteins - metabolism ; Sea water ; Sequence Homology, Amino Acid</subject><ispartof>Proceedings of the National Academy of Sciences - PNAS, 2007-08, Vol.104 (35), p.13913-13918</ispartof><rights>Copyright 2007 The National Academy of Sciences of the United States of America</rights><rights>Copyright National Academy of Sciences Aug 28, 2007</rights><rights>2007 by The National Academy of Sciences of the USA 2007</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c554t-5f7fcefc091477a54f8eacac2e675ce63dadf2f1bc115d4ca24b33acdf0a9dd83</citedby><cites>FETCH-LOGICAL-c554t-5f7fcefc091477a54f8eacac2e675ce63dadf2f1bc115d4ca24b33acdf0a9dd83</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Uhttp://www.pnas.org/content/104/35.cover.gif</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/25436596$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://www.jstor.org/stable/25436596$$EHTML$$P50$$Gjstor$$H</linktohtml><link.rule.ids>230,314,727,780,784,885,27924,27925,53791,53793,58238,58471</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/17717083$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Harrington, E.D</creatorcontrib><creatorcontrib>Singh, A.H</creatorcontrib><creatorcontrib>Doerks, T</creatorcontrib><creatorcontrib>Letunic, I</creatorcontrib><creatorcontrib>von Mering, C</creatorcontrib><creatorcontrib>Jensen, L.J</creatorcontrib><creatorcontrib>Raes, J</creatorcontrib><creatorcontrib>Bork, P</creatorcontrib><title>Quantitative assessment of protein function prediction from metagenomics shotgun sequences</title><title>Proceedings of the National Academy of Sciences - PNAS</title><addtitle>Proc Natl Acad Sci U S A</addtitle><description>To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.</description><subject>Animals</subject><subject>Biochemistry</subject><subject>Biofilms</subject><subject>Biological Sciences</subject><subject>Biosynthesis</subject><subject>Cogs</subject><subject>Databases, Factual</subject><subject>Datasets</subject><subject>Fatty acids</subject><subject>Genes</subject><subject>Genetic Variation</subject><subject>Genome</subject><subject>Genome, Bacterial</subject><subject>Genomes</subject><subject>Genomic Library</subject><subject>Genomics</subject><subject>Hemoglobin</subject><subject>Metagenomics</subject><subject>Models, Genetic</subject><subject>Open Reading Frames</subject><subject>Proteins</subject><subject>Proteins - genetics</subject><subject>Proteins - metabolism</subject><subject>Sea water</subject><subject>Sequence Homology, Amino Acid</subject><issn>0027-8424</issn><issn>1091-6490</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2007</creationdate><recordtype>article</recordtype><recordid>eNqFkc9rFDEUx4Modq2ePamDB8HDti-TZDJzEaT4Cwoi2ouXkM28bLPMJNskU_S_N8MsXfXSU0K-n_fl-_Il5DmFMwqSne-9TmcgoW5YQ4E_ICsKHV03vIOHZAVQy3XLa35CnqS0A4BOtPCYnFApqYSWrcjPb5P22WWd3S1WOiVMaUSfq2CrfQwZna_s5E12wZcH7N1ytTGM1YhZb9GH0ZlUpeuQt5OvEt5M6A2mp-SR1UPCZ4fzlFx9_PDj4vP68uunLxfvL9dGCJ7Xwkpr0JqSm0upBbctaqNNjY0UBhvW697Wlm4MpaLnRtd8w5g2vQXd9X3LTsm7xXc_bUbsTUkf9aD20Y06_lZBO_Wv4t212oZbRTsh2hqKwZuDQQwle8pqdMngMGiPYUqqaWtKKRf3gsVLsMIW8PV_4C5M0ZdfKAzl0IKYofMFMjGkFNHeRaag5nbV3K46tlsmXv696ZE_1FmA6gDMk0c7rphQlHV0Rt7egyg7DUPGX7mwLxZ2l3KId3AtOGtE1xT91aJbHZTeRpfU1feyIIOyIe1kx_4ATFTP6A</recordid><startdate>20070828</startdate><enddate>20070828</enddate><creator>Harrington, E.D</creator><creator>Singh, A.H</creator><creator>Doerks, T</creator><creator>Letunic, I</creator><creator>von Mering, C</creator><creator>Jensen, L.J</creator><creator>Raes, J</creator><creator>Bork, P</creator><general>National Academy of Sciences</general><general>National Acad Sciences</general><scope>FBQ</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QG</scope><scope>7QL</scope><scope>7QP</scope><scope>7QR</scope><scope>7SN</scope><scope>7SS</scope><scope>7T5</scope><scope>7TK</scope><scope>7TM</scope><scope>7TO</scope><scope>7U9</scope><scope>8FD</scope><scope>C1K</scope><scope>FR3</scope><scope>H94</scope><scope>M7N</scope><scope>P64</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>20070828</creationdate><title>Quantitative assessment of protein function prediction from metagenomics shotgun sequences</title><author>Harrington, E.D ; Singh, A.H ; Doerks, T ; Letunic, I ; von Mering, C ; Jensen, L.J ; Raes, J ; Bork, P</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c554t-5f7fcefc091477a54f8eacac2e675ce63dadf2f1bc115d4ca24b33acdf0a9dd83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2007</creationdate><topic>Animals</topic><topic>Biochemistry</topic><topic>Biofilms</topic><topic>Biological Sciences</topic><topic>Biosynthesis</topic><topic>Cogs</topic><topic>Databases, Factual</topic><topic>Datasets</topic><topic>Fatty acids</topic><topic>Genes</topic><topic>Genetic Variation</topic><topic>Genome</topic><topic>Genome, Bacterial</topic><topic>Genomes</topic><topic>Genomic Library</topic><topic>Genomics</topic><topic>Hemoglobin</topic><topic>Metagenomics</topic><topic>Models, Genetic</topic><topic>Open Reading Frames</topic><topic>Proteins</topic><topic>Proteins - genetics</topic><topic>Proteins - metabolism</topic><topic>Sea water</topic><topic>Sequence Homology, Amino Acid</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Harrington, E.D</creatorcontrib><creatorcontrib>Singh, A.H</creatorcontrib><creatorcontrib>Doerks, T</creatorcontrib><creatorcontrib>Letunic, I</creatorcontrib><creatorcontrib>von Mering, C</creatorcontrib><creatorcontrib>Jensen, L.J</creatorcontrib><creatorcontrib>Raes, J</creatorcontrib><creatorcontrib>Bork, P</creatorcontrib><collection>AGRIS</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Animal Behavior Abstracts</collection><collection>Bacteriology Abstracts (Microbiology B)</collection><collection>Calcium &amp; Calcified Tissue Abstracts</collection><collection>Chemoreception Abstracts</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Immunology Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Oncogenes and Growth Factors Abstracts</collection><collection>Virology and AIDS Abstracts</collection><collection>Technology Research Database</collection><collection>Environmental Sciences and Pollution Management</collection><collection>Engineering Research Database</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Proceedings of the National Academy of Sciences - PNAS</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Harrington, E.D</au><au>Singh, A.H</au><au>Doerks, T</au><au>Letunic, I</au><au>von Mering, C</au><au>Jensen, L.J</au><au>Raes, J</au><au>Bork, P</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Quantitative assessment of protein function prediction from metagenomics shotgun sequences</atitle><jtitle>Proceedings of the National Academy of Sciences - PNAS</jtitle><addtitle>Proc Natl Acad Sci U S A</addtitle><date>2007-08-28</date><risdate>2007</risdate><volume>104</volume><issue>35</issue><spage>13913</spage><epage>13918</epage><pages>13913-13918</pages><issn>0027-8424</issn><eissn>1091-6490</eissn><abstract>To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.</abstract><cop>United States</cop><pub>National Academy of Sciences</pub><pmid>17717083</pmid><doi>10.1073/pnas.0702636104</doi><tpages>6</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0027-8424
ispartof Proceedings of the National Academy of Sciences - PNAS, 2007-08, Vol.104 (35), p.13913-13918
issn 0027-8424
1091-6490
language eng
recordid cdi_pnas_primary_104_35_13913
source Open Access: PubMed Central; JSTOR Archival Journals and Primary Sources Collection
subjects Animals
Biochemistry
Biofilms
Biological Sciences
Biosynthesis
Cogs
Databases, Factual
Datasets
Fatty acids
Genes
Genetic Variation
Genome
Genome, Bacterial
Genomes
Genomic Library
Genomics
Hemoglobin
Metagenomics
Models, Genetic
Open Reading Frames
Proteins
Proteins - genetics
Proteins - metabolism
Sea water
Sequence Homology, Amino Acid
title Quantitative assessment of protein function prediction from metagenomics shotgun sequences
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T20%3A18%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-jstor_pnas_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Quantitative%20assessment%20of%20protein%20function%20prediction%20from%20metagenomics%20shotgun%20sequences&rft.jtitle=Proceedings%20of%20the%20National%20Academy%20of%20Sciences%20-%20PNAS&rft.au=Harrington,%20E.D&rft.date=2007-08-28&rft.volume=104&rft.issue=35&rft.spage=13913&rft.epage=13918&rft.pages=13913-13918&rft.issn=0027-8424&rft.eissn=1091-6490&rft_id=info:doi/10.1073/pnas.0702636104&rft_dat=%3Cjstor_pnas_%3E25436596%3C/jstor_pnas_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c554t-5f7fcefc091477a54f8eacac2e675ce63dadf2f1bc115d4ca24b33acdf0a9dd83%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=201408051&rft_id=info:pmid/17717083&rft_jstor_id=25436596&rfr_iscdi=true