Loading…

Modeling protein evolution with several amino acid replacement matrices depending on site rates

Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different...

Full description

Saved in:
Bibliographic Details
Published in:Molecular biology and evolution 2012-10, Vol.29 (10), p.2921-2936
Main Authors: Le, Si Quang, Dang, Cuong Cao, Gascuel, Olivier
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c396t-58493f66f9eb19a1ca430497dca3024f70f2c59c570307fe07353940f1d411553
cites cdi_FETCH-LOGICAL-c396t-58493f66f9eb19a1ca430497dca3024f70f2c59c570307fe07353940f1d411553
container_end_page 2936
container_issue 10
container_start_page 2921
container_title Molecular biology and evolution
container_volume 29
creator Le, Si Quang
Dang, Cuong Cao
Gascuel, Olivier
description Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x.
doi_str_mv 10.1093/molbev/mss112
format article
fullrecord <record><control><sourceid>proquest_hal_p</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_lirmm_00715443v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1080882993</sourcerecordid><originalsourceid>FETCH-LOGICAL-c396t-58493f66f9eb19a1ca430497dca3024f70f2c59c570307fe07353940f1d411553</originalsourceid><addsrcrecordid>eNpdkT1vFDEQhi0EIpdASYss0SBFS2bW3g-XUUQSpENpQm35vLPEkb0-bO9F_Hv22JCCylM8845fPYx9QPiCoMRFiH5Hh4uQM2L9im2wEV2FHarXbAPdMksQ_Qk7zfkRAKVs27fspK6lQhDthunvcSDvpp98n2IhN3E6RD8XFyf-5MoDz3SgZDw3wU2RG-sGnmjvjaVAU-HBlOQsZT7QnqbhGLRsZleIJ1Mov2NvRuMzvX9-z9iP66_3V7fV9u7m29XltrJCtaVqeqnE2Lajoh0qg9ZIAVJ1gzUCajl2MNa2UbbpQEA30tKsEUrCiINEbBpxxs7X3Afj9T65YNJvHY3Tt5db7V0KQQN02EgpDrjQn1d6Kf1rplx0cNmS92aiOGeN0EPf10qJBf30H_oY5zQtXf5SnQJRH89XK2VTzDnR-PIHBH30pFdPevW08B-fU-ddoOGF_idG_AG5TI8r</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1080790325</pqid></control><display><type>article</type><title>Modeling protein evolution with several amino acid replacement matrices depending on site rates</title><source>NCBI_PubMed Central(免费)</source><source>Open Access: Oxford University Press Open Journals</source><source>Free Full-Text Journals in Chemistry</source><creator>Le, Si Quang ; Dang, Cuong Cao ; Gascuel, Olivier</creator><creatorcontrib>Le, Si Quang ; Dang, Cuong Cao ; Gascuel, Olivier</creatorcontrib><description>Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x.</description><identifier>ISSN: 0737-4038</identifier><identifier>EISSN: 1537-1719</identifier><identifier>DOI: 10.1093/molbev/mss112</identifier><identifier>PMID: 22491036</identifier><language>eng</language><publisher>United States: Oxford University Press</publisher><subject>Algorithms ; Amino Acid Substitution - genetics ; Amino acids ; Biodiversity ; Bioinformatics ; Computer Science ; Databases, Protein ; Evolution &amp; development ; Evolution, Molecular ; Heterogeneity ; Life Sciences ; Likelihood Functions ; Models, Genetic ; Molecular biology ; Mutation Rate ; Populations and Evolution ; Proteins ; Proteins - genetics ; Quantitative Methods ; Time Factors</subject><ispartof>Molecular biology and evolution, 2012-10, Vol.29 (10), p.2921-2936</ispartof><rights>Copyright Oxford Publishing Limited(England) Oct 2012</rights><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c396t-58493f66f9eb19a1ca430497dca3024f70f2c59c570307fe07353940f1d411553</citedby><cites>FETCH-LOGICAL-c396t-58493f66f9eb19a1ca430497dca3024f70f2c59c570307fe07353940f1d411553</cites><orcidid>0000-0002-3715-210X ; 0000-0002-9412-9723</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,27922,27923</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/22491036$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink><backlink>$$Uhttps://hal-lirmm.ccsd.cnrs.fr/lirmm-00715443$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Le, Si Quang</creatorcontrib><creatorcontrib>Dang, Cuong Cao</creatorcontrib><creatorcontrib>Gascuel, Olivier</creatorcontrib><title>Modeling protein evolution with several amino acid replacement matrices depending on site rates</title><title>Molecular biology and evolution</title><addtitle>Mol Biol Evol</addtitle><description>Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x.</description><subject>Algorithms</subject><subject>Amino Acid Substitution - genetics</subject><subject>Amino acids</subject><subject>Biodiversity</subject><subject>Bioinformatics</subject><subject>Computer Science</subject><subject>Databases, Protein</subject><subject>Evolution &amp; development</subject><subject>Evolution, Molecular</subject><subject>Heterogeneity</subject><subject>Life Sciences</subject><subject>Likelihood Functions</subject><subject>Models, Genetic</subject><subject>Molecular biology</subject><subject>Mutation Rate</subject><subject>Populations and Evolution</subject><subject>Proteins</subject><subject>Proteins - genetics</subject><subject>Quantitative Methods</subject><subject>Time Factors</subject><issn>0737-4038</issn><issn>1537-1719</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2012</creationdate><recordtype>article</recordtype><recordid>eNpdkT1vFDEQhi0EIpdASYss0SBFS2bW3g-XUUQSpENpQm35vLPEkb0-bO9F_Hv22JCCylM8845fPYx9QPiCoMRFiH5Hh4uQM2L9im2wEV2FHarXbAPdMksQ_Qk7zfkRAKVs27fspK6lQhDthunvcSDvpp98n2IhN3E6RD8XFyf-5MoDz3SgZDw3wU2RG-sGnmjvjaVAU-HBlOQsZT7QnqbhGLRsZleIJ1Mov2NvRuMzvX9-z9iP66_3V7fV9u7m29XltrJCtaVqeqnE2Lajoh0qg9ZIAVJ1gzUCajl2MNa2UbbpQEA30tKsEUrCiINEbBpxxs7X3Afj9T65YNJvHY3Tt5db7V0KQQN02EgpDrjQn1d6Kf1rplx0cNmS92aiOGeN0EPf10qJBf30H_oY5zQtXf5SnQJRH89XK2VTzDnR-PIHBH30pFdPevW08B-fU-ddoOGF_idG_AG5TI8r</recordid><startdate>20121001</startdate><enddate>20121001</enddate><creator>Le, Si Quang</creator><creator>Dang, Cuong Cao</creator><creator>Gascuel, Olivier</creator><general>Oxford University Press</general><general>Oxford University Press (OUP)</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QG</scope><scope>7QP</scope><scope>7QR</scope><scope>7SN</scope><scope>7SS</scope><scope>7TK</scope><scope>7TM</scope><scope>7TO</scope><scope>7U9</scope><scope>8FD</scope><scope>C1K</scope><scope>FR3</scope><scope>H94</scope><scope>K9.</scope><scope>M7N</scope><scope>P64</scope><scope>RC3</scope><scope>7X8</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-3715-210X</orcidid><orcidid>https://orcid.org/0000-0002-9412-9723</orcidid></search><sort><creationdate>20121001</creationdate><title>Modeling protein evolution with several amino acid replacement matrices depending on site rates</title><author>Le, Si Quang ; Dang, Cuong Cao ; Gascuel, Olivier</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c396t-58493f66f9eb19a1ca430497dca3024f70f2c59c570307fe07353940f1d411553</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2012</creationdate><topic>Algorithms</topic><topic>Amino Acid Substitution - genetics</topic><topic>Amino acids</topic><topic>Biodiversity</topic><topic>Bioinformatics</topic><topic>Computer Science</topic><topic>Databases, Protein</topic><topic>Evolution &amp; development</topic><topic>Evolution, Molecular</topic><topic>Heterogeneity</topic><topic>Life Sciences</topic><topic>Likelihood Functions</topic><topic>Models, Genetic</topic><topic>Molecular biology</topic><topic>Mutation Rate</topic><topic>Populations and Evolution</topic><topic>Proteins</topic><topic>Proteins - genetics</topic><topic>Quantitative Methods</topic><topic>Time Factors</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Le, Si Quang</creatorcontrib><creatorcontrib>Dang, Cuong Cao</creatorcontrib><creatorcontrib>Gascuel, Olivier</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Animal Behavior Abstracts</collection><collection>Calcium &amp; Calcified Tissue Abstracts</collection><collection>Chemoreception Abstracts</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Neurosciences Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Oncogenes and Growth Factors Abstracts</collection><collection>Virology and AIDS Abstracts</collection><collection>Technology Research Database</collection><collection>Environmental Sciences and Pollution Management</collection><collection>Engineering Research Database</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>Molecular biology and evolution</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Le, Si Quang</au><au>Dang, Cuong Cao</au><au>Gascuel, Olivier</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Modeling protein evolution with several amino acid replacement matrices depending on site rates</atitle><jtitle>Molecular biology and evolution</jtitle><addtitle>Mol Biol Evol</addtitle><date>2012-10-01</date><risdate>2012</risdate><volume>29</volume><issue>10</issue><spage>2921</spage><epage>2936</epage><pages>2921-2936</pages><issn>0737-4038</issn><eissn>1537-1719</eissn><abstract>Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x.</abstract><cop>United States</cop><pub>Oxford University Press</pub><pmid>22491036</pmid><doi>10.1093/molbev/mss112</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0002-3715-210X</orcidid><orcidid>https://orcid.org/0000-0002-9412-9723</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0737-4038
ispartof Molecular biology and evolution, 2012-10, Vol.29 (10), p.2921-2936
issn 0737-4038
1537-1719
language eng
recordid cdi_hal_primary_oai_HAL_lirmm_00715443v1
source NCBI_PubMed Central(免费); Open Access: Oxford University Press Open Journals; Free Full-Text Journals in Chemistry
subjects Algorithms
Amino Acid Substitution - genetics
Amino acids
Biodiversity
Bioinformatics
Computer Science
Databases, Protein
Evolution & development
Evolution, Molecular
Heterogeneity
Life Sciences
Likelihood Functions
Models, Genetic
Molecular biology
Mutation Rate
Populations and Evolution
Proteins
Proteins - genetics
Quantitative Methods
Time Factors
title Modeling protein evolution with several amino acid replacement matrices depending on site rates
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T13%3A58%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_hal_p&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Modeling%20protein%20evolution%20with%20several%20amino%20acid%20replacement%20matrices%20depending%20on%20site%20rates&rft.jtitle=Molecular%20biology%20and%20evolution&rft.au=Le,%20Si%20Quang&rft.date=2012-10-01&rft.volume=29&rft.issue=10&rft.spage=2921&rft.epage=2936&rft.pages=2921-2936&rft.issn=0737-4038&rft.eissn=1537-1719&rft_id=info:doi/10.1093/molbev/mss112&rft_dat=%3Cproquest_hal_p%3E1080882993%3C/proquest_hal_p%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c396t-58493f66f9eb19a1ca430497dca3024f70f2c59c570307fe07353940f1d411553%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1080790325&rft_id=info:pmid/22491036&rfr_iscdi=true