Some Advances on Modeling High-Dimensional Data with Complex Structures
[摘要] Recent advances in technology have created an abundance of high-dimensional data and made its analysis possible.These data require new, computationally efficient methodology and new kind of asymptotic analysis.This thesis consists of four projects that deal with high-dimensional data with complex structures.The first project focuses on the graph estimation problem for Gaussian graphical models.Graphical models are commonly used in representing conditional independence between random variables, and learning the conditional independence structure from data has attracted much attention in recent years.However, almost all commonly used graph learning methods rely on the assumption that the observations share the same mean vector. In the first project, we extend the Gaussian graphical model to the setting where the observations are connected by a network and the mean vector can be different for different observations.We propose an efficient estimation method for the model, and under the assumption of network cohesion, we show that our method can accurately estimate the inverse covariance matrix as well as the corresponding graph structure, both from the theoretical perspective and using numerical studies.To further demonstrate the effectiveness of the proposed method, we also analyze a statisticians;; coauthorship network data to learn the term dependency based on statistics publications.The second project addresses the directed acyclic graph (DAG) estimation problem.Estimation of the DAG structure is often a challenging problem as the computational complexity scales exponentially in the graph size when the total ordering of the DAG is unknown.To reduce the computational cost, and also with the aim of improving the estimation accuracy via the bias-variance trade-off, we propose a two-step approach for estimating the DAG, when data are generated from a linear structural equation model.In the first step, we infer the moral graph of the DAG via estimation of the inverse covariance matrix, which reduces the parameter space that one would search for the DAG. In the second step, we apply a penalized likelihood method for estimating the DAG restricted in the reduced space.Numerical studies indicate that the proposed method compares favorably with the one-step method in terms of both computational cost and estimation accuracy.The third and fourth projects investigate supervised learning problems. Specifically, in the third project, we study the cointegration problem for multivariate time series data and propose a method for identifying cointegrating vectors with simultaneously group and elementwise sparse structures.Such a sparsity structure enables the elimination of certain coordinates of the original multivariate series from all cointegrated series, leading to parsimonious and potentially more interpretable cointegrating vectors. Specifically, we formulate an optimization problem based on the profile likelihood and propose an iterative algorithm for solving the optimization problem. The proposed method has been evaluated on synthetic data and also applied to two real world data examples involving daily prices of financial sector stocks and monthly treasury yields of different maturities.In the fourth project, we focus on the learning to rank problem with sparse feature selection. In particular, we extend the rank support vector machine method to the sparse setting, by applying the lasso and elastic-net penalties.We also employ the bundle method and the order statistic tree data structure to reduce the computational complexity.Numerical results indicate that the proposed method works well in both simulation studies and a real world stock selection problem.
[发布日期] [发布机构] University of Michigan
[效力级别] Statistics and Numeric Data [学科分类]
[关键词] High-Dimensional;Statistics and Numeric Data;Science;Statistics [时效性]