Statistical and Computational Methods for Differential Expression Analysis in High-throughput Gene Expression Data
[摘要] In this dissertation, we develop novel statistical and computational methods for differential expression analysis in high-throughput gene expression data. In the first part, we develop statistical models for differential expression with a variety of study designs. In project one, we present an efficient algorithm for the detection of differential expression and splicing of genes in RNA-Seq data. Our approach considers three cases for each gene: no differential expression, differential expression without differential splicing, and differential splicing. We use a Poisson regression framework to model the read counts and a hierarchical likelihood ratio test for model selection. In project two, we present a non-parametric approach for the joint detection of differential expression and splicing of genes by introducing a new statistic named gene-level differential score and using a permutation test to assess the statistical significance. The method can be applied to a variety of experimental designs, including those with two (unpaired or paired) or multiple biological conditions, and those with quantitative or survival outcomes. In project three, we model the single-cell gene expression data using a two-part mixed model, which not only adequately accounts for the distinct features of single cell expression data, including extra zero expression values, high variability and clustered design, but also provides the flexibility of adjusting for covariates. Comparisons with existing methods, our approach achieves improved power for detecting differential expressed genes. In the second part, we propose novel methods to improve the computational efficiency of resampling-based test methods in genomics. In project four, we present a fast algorithm for evaluating small p-values from permutation tests based on the cross-entropy method. In chapter five, we develop an efficient algorithm for estimating small p-values in parametric bootstrap tests using the improved cross-entropy method to approximate the optimal proposal density and the Hamiltonian Monte Carlo method to efficiently sample from the optimal proposal density. These methods together address a critical challenge for resampling-based tests in genomics since an enormous number of resamples is needed for estimating very small p-values. Simulations and applications to real data demonstrate that our methods achieve significant gains in computational efficiency comparing with existing methods.
[发布日期] [发布机构] University of Michigan
[效力级别] differential expression [学科分类]
[关键词] RNA-Seq;differential expression;permutation tests;parametric bootstrap tests;cross-entropy;mixed models;Genetics;Public Health;Statistics and Numeric Data;Health Sciences;Science;Biostatistics [时效性]