Phylogeny Inference in the Presence of Incomplete Lineage Sorting, Gene Duplication and Loss and Hybridization
[摘要] A species phylogeny captures how a set of extant species split and diverged from their most recentcommon ancestral species. A gene tree captures the evolutionary history of an individual gene or,more generally, non-recombining genomic region. A very complex relationship exists between thephylogeny of a set of species and the trees of genes in the genomes of those species. The complexityarises because of processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL),and hybridization, all of which can give rise to gene trees whose topologies disagree with each otheras well as with that of the species phylogeny.Species phylogeny inference in the post-genomic era, also known as phylogenomic inference,requires developing models and methods that account for these processes in order to relate howindividual loci (genomic regions) evolve within and across the branches of species phylogenies. Forexample, the multispecies coalescent (MSC) has been introduced to model ILS, and statistical speciestree inference methods based on it have been developed. This model was later extended to allow forreticulation events (e.g., hybridization), and statistical methods for inferring phylogenetic networkswere developed. Birth-death models of gene evolution have also been introduced to capture geneduplications and losses, and species tree inference methods that utilize them have been developed.In this thesis, I address two computational problems that arise in this domain. The first problemconcerns the inference of species trees from multiple loci assuming that only ILS and GDL are atplay, but not reticulation. The second problem concerns the inference of species (phylogenetic)networks from multiple loci when all three processes ILS, GDL, and reticulation are at play. Mycontribution for the first problem is twofold. First, I developed and implemented a heuristic formaximum a posteriori (MAP) estimate of the species tree from the sequence alignments of multipleindependent loci. Second, based on a study of the accuracy of MSC-based inference methods ondata where GDL is at play, I proposed a method for efficient inference of the topology of a speciestree in the presence of both ILS and GDL. My contribution for the second problem is twofold aswell. I first developed the first three-piece model of phylogenetic network / locus network / genetree, which accurately captures the three aforementioned processes and yields a generative modelof genomic sequence data from a phylogenetic network. I then developed a heuristic for inferringphylogenetic networks from multi-locus data under this generative model. I studied the accuracyof all methods on both simulated and biological data sets.The contributions of my thesis provide further advances in the field of phylogenomics by providing methods that incorporate more of the biological complexity in evolution than existing methodsdo. Consequently, my methods allow for utilizing more of the genomic data (and signal) for a moreaccurate inference of not only the species phylogeny, but also the processes that acted upon theindividual loci within the genomes of those species.
[发布日期] [发布机构] Rice University
[效力级别] Gene [学科分类]
[关键词] [时效性]