Parameter Estimation and Multilevel Clustering with Mixture and Hierarchical Models
[摘要] In the big data era, data are typically collected at massive scales and often carry complex structures, which lead to unprecedented modeling and computational challenges. In numerous applications of engineering and applied sciences there are indisputable evidence of the presence of hidden subpopulations in the whole data where each subpopulation has its own features. Due to their great modeling flexibility, mixture and hierarchical models have been widely utilized by researchers to uncover these multi-level structures. However, several outstanding problems arise from these models. Firstly, it has long been observed in practice that convergence behaviors of latent variables in these models are problematic. Secondly, state of the art hierarchical models tend to perform unsatisfactorily under large-scale and complex structures settings of data. Last but not least, in many practical problems mixture and hierarchical models are strongly affected by outliers or departures from model assumptions. The overarching themes in the thesis focus on dealing with these challenges. Our main contributions include the following. We develop a systematic understanding of statistical efficiency of parameter estimation in finite mixture models. Our studies make explicit the deep links between model singularities, parameter estimation convergence rates, and the algebraic geometry of the parameter space for mixtures of continuous distributions. Next, we develop robust estimators of mixing measure in finite mixture models using the idea of minimum Hellinger distance estimator, model selection criteria, and super-efficiency phenomenon. Finally, we propose efficient and scalable joint optimization approaches to cluster a potentially large hierarchically structured corpus of data, which aim to simultaneously partition data in each group and discover grouping patterns among groups.
[发布日期] [发布机构] University of Michigan
[效力级别] Wasserstein distances [学科分类]
[关键词] Mixture models;Wasserstein distances;Fisher singularities;system of polynomial equations;maximum likelihood estimation;robust statistics;convergence rates;multilevel clustering;minimax theory;algebraic geometry;Statistics and Numeric Data;Science;Statistics [时效性]