Statistical Methods of Data Integration, Model Fusion, and Heterogeneity Detection in Big Biomedical Data Analysis
[摘要] Interesting and challenging methodological questions arise from the analysis of Big Biomedical Data, where viable solutions are sought with the help of modern computational tools. In this dissertation, I look at problems in biomedical studies related to data integration, data heterogeneity, and related statistical learning algorithms. The overarching strategy throughout the dissertation research is rooted in the treatment of individual datasets, but not individual subjects, as the elements of focus. Thus, I generalized some of the traditional subject-level methods to be tailored for the development of Big Data methodologies.Following an introduction overview in the first chapter, Chapter II concerns the development of fusion learning of model heterogeneity in data integration via a regression coefficient clustering method. The statistical learning procedure is built for the generalized linear models, and enforces an adjacent fusion penalty on ordered parameters (Wang et al., 2016). This is an adaptation of the fused lasso (Tibshirani et al., 2005), and an extension to the homogeneity pursuit (Ke et al., 2015) that only considers a single data set. Using this method, we can identify regression coefficient heterogeneity across sub-datasets and fuse homogeneous subsets to greatly simplify the regression model, so to improve statistical power. The proposed fusion learning algorithm (published as Tang and Song (2016)) allows the integration of a large number of sub-datasets, a clear advantage over the traditional methods with stratum-covariate interactions or random effects. This method is useful to cluster treatment effects, so some outlying studies may be detected. We demonstrate our method with datasets from the Panel Study of Income Dynamics and from the Early Life Exposures in Mexico to Environmental Toxicants study. This method has also been extended to the Cox proportional hazards model to handle time-to-event response.Chapter III, under the assumption of homogeneous generalized linear model, focuses on the development of a divide-and-combine method for extremely large data that may be stored on distributed file systems. Using the means of confidence distribution (Fisher, 1956; Efron, 1993), I develop a procedure to combine results from different sub-datasets, where lasso is used to reduce model size in order to achieve numerical stability. The algorithm fits into the MapReduce paradigm and may be perfectly parallelized. To deal with estimation bias incurred by lasso regularization, a de-bias step is invoked so the proposed method can enjoy a valid inference. The method is conceptually simple, and computationally scalable and fast, with the numerical evidence illustrated in the comparison with the benchmark maximum likelihood estimator based on full data, and some other competing divide-and-combine-type methods. We apply the method to a large public dataset from the National Highway Traffic Safety Administration on identifying the risk factors of accident injury.In Chapter IV, I generalize the fusion learning algorithm given in Chapter II and develop a coefficient clustering method for correlated data in the context of the generalized estimating equations.The motivation of this generalization is to assess model heterogeneity for the pattern mixture modeling approach (Little, 1993) where models are stratified by missing data patterns. This is one of primary strategies in the literature to deal with the informative missing data mechanism. My method aims to simplify the pattern mixture model by fusing some homogeneous parameters under the generalized estimating equations (GEE, Liang and Zeger (1986)) framework.
[发布日期] [发布机构] University of Michigan
[效力级别] Fusion learning [学科分类]
[关键词] Data integration;Fusion learning;Distributed computing;Regularized regression;Missing data;Longitudinal data;Public Health;Statistics and Numeric Data;Health Sciences;Science;Biostatistics [时效性]