Model-Based Genomic Studies of Protein Sequence Evolution: Convergence, Epistasis, and Amino Acid Acceptance Rates

[摘要] Protein sequence changes are a major contributor to phenotypic evolution and biodiversity. While the genomic revolution has drastically increased the available amount of protein sequence data for comparative studies, development of analytic tools lags behind. In particular, current mathematical models of sequence evolution are over-simplified and typically ignore many heterogeneities in evolutionary processes. As a result, they often provide inadequate descriptions of evolution, leading to misleading conclusions. My thesis uncovers some of these heterogeneities and demonstrates that incorporating them into mathematical models of protein sequence evolution offers new insights into evolutionary mechanisms. For instance, convergent evolution of morphological traits has long interested biologists because it is a strong indicator of common natural selections in independent evolutionary lineages. Similarly, convergent evolution of protein sequences is commonly thought to have resulted from natural selection. In Chapter 2 of this thesis, however, I show that such interpretations are problematic, because sequence convergence can be explained by neutral evolution as long as among-site variations in amino acid composition are considered. I also find that the convergence level reduces with genetic distance. In Chapter 3, I evaluate two hypotheses that could explain the diminishing convergence with genetic distance: (i) divergent epistasis in distantly related organisms and (ii) gene tree discordance. I demonstrate that both hypotheses are at work, but their contributions vary depending on how closely related the species of interest are. In Chapter 4, I revisit a high-profile claim of genome-wide adaptive protein sequence convergence for echolocation in three lineages of mammals. I discover that the amount of convergence observed is no more than those in proper negative controls, suggesting that these sequence convergences are largely neutral and unrelated to echolocation. A widely believed but never critically tested hypothesis in phylogenetics is that morphological data contain more convergence and hence are less suitable for phylogenetic inference than molecular data. Analyzing a large dataset including thousands of morphological traits and thousands of molecular traits, I find unequivocal evidence for this hypothesis and uncover its underlying cause in Chapter 5. I subsequently design a method to identify and remove highly convergent traits, leading to higher phylogenetic accuracies. In Chapter 6, I report a new type of evolutionary heterogeneity that potentially contributes to phylogenetic error: between-species variation in the probability with which a mutation between a specific pair of amino acids is fixed. In Chapter 7, I find that this heterogeneity leads to another previously unknown heterogeneity among species: the fitness disadvantage of nonsynonymous transversions relative to that of nonsynonymous transitions, a subject that has been studied since the dawn of the field molecular evolution. These six chapters, along with the introductory and concluding chapters, provide an integrative study of previously unknown or neglected heterogeneities in protein sequence evolution. Together, they correct misconceptions in molecular evolution, help improve phylogenetic inference, and deepen our understanding of evolutionary mechanisms.

[发布日期] [发布机构] University of Michigan

[效力级别] convergence [学科分类]

[关键词] protein sequence evolution;convergence;Ecology and Evolutionary Biology;Genetics;Science;Bioinformatics [时效性]

浏览次数：30

统一登录查看全文激活码登录查看全文