Computation, Visualization, and Applications of Convex Clustering
[摘要] Clustering is a ubiquitous tool for exploratory data analysis across the sciences,with the general aim of identifying groups of similar objects. Recent work has recastthe clustering problem within the framework of convex optimization, addressing manyshortcomings of traditional methods such as interpretability, stability, and parameterselection. The method of Convex Clustering has proven to be a canonical example ofsuch an approach, and its extensions and applications will be the focus of this work.We begin by considering the application of Convex Clustering in the novel settingof region detection for high-throughput genomic data. We illustrate the versatilityof Convex Clustering by developing a novel extension, Spatial Convex Clustering(SpaCC), specifically catered to multivariate spatially correlated genomics data. Wedemonstrate SpaCC to achieve state-of-the-art performance on the well-studied prob-lem of Copy Number Segmentation, and show it to be similarly successful in the novelsetting of DNA Methylation region detection. Next, we address several shortcomingsof Convex Clustering including slow computation and lack of familiar visualizationsrelative to its traditional counterparts. To do so, we introduce algorithms for the fastapproximation of the Convex Clustering solution path and provide both theoreticalguarantees of error control as well as empirical investigations. Next, we provide asuite of visualization techniques to aid in the interpretation of the clustering solutioniiipath, exploring their insights via several real data examples. Finally we introduce theR-package, clustRviz, which gives practitioners direct access to the fast computationand dynamic visualizations introduced throughout.
[发布日期] [发布机构] Rice University
[效力级别] Convex [学科分类]
[关键词] [时效性]