Loading…
A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins
Different local regions of natural amino acid or nucleotide sequences show remarkable heterogeneity in residue composition, reflecting diversity in evolutionary history and physicochemical constraints. Compositional complexity measures are helpful for describing and understanding this variegation. M...
Saved in:
Published in: | Computers & chemistry 2000-01, Vol.24 (1), p.71-94 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Different local regions of natural amino acid or nucleotide sequences show remarkable heterogeneity in residue composition, reflecting diversity in evolutionary history and physicochemical constraints. Compositional complexity measures are helpful for describing and understanding this variegation. Motivated by some open problems in comparative genomics and protein folding, we have developed a new 'global' compositional complexity measure, G sub(1), which overcomes a crucial limitation of earlier methods. The 'local' measures used in previous research resemble entropy functions and are inherently dependent on an underlying probability distribution. Local measures cannot rigorously compare complexity across sequences of substantially different size, because real sequences show very irregular heterogeneity and do not have the necessary ergodicity in scaling and asymptotic properties. G sub(1) is a member of a new class of scale-independent, distribution-independent complexity functions. For a sequence S of length L on an N-letter alphabet, G sub(1) is derived from ratios in the integer partition lattice, P sub({L,N}) of L with N parts, where the elements of P sub({L,N}) are the state vectors of S, (n sub(1), n sub(2), times times times , n sub(N)), ranked by an order principle. We present theorems and proofs relating to the metric properties of G sub(1) and its relationship to other state-vector-dependent compositional complexity functions, together with a fully-efficient O(L) algorithm to compute G sub(1). The distributions of G sub(1) were calculated for the entire sets of translated proteins encoded by extensively sequenced genomes. The results establish the existence of a clear evolutionary principle, common to bacteria, archaea and eukaryotes, that the proteins encoded by more extreme AT-rich and GC-rich genomes have generally lower compositional complexity than those of more typical organisms. |
---|---|
ISSN: | 0097-8485 |
DOI: | 10.1016/S0097-8485(99)00048-0 |