Tech Report: HPL-2000-6:Scale Up Center-Based Data
[摘要] As data collection increases at an accelerating rate with the advances of computers and networking technology, analyzing the data (data mining) becomes very important. Data clustering is one of the basic tools widely used as a component in many data mining solutions. Even though many data clusteringalgorithms have been developed in the last few decades, theyface new challenges in front of hugh data sets. Algorithmswith quadratic (or higher order) computational complexity, like agglomerative algorithms, drop out very quickly. More efficient algorithms like K-Means and EM, which have linear cost per iteration, also need scale-up before they can be applied to verylarge data sets. This paper shows that many parameter estimation algorithms, including the clustering algorithms like K-Means, K-Harmonic Means and EM,have intrinsic parallel structure in them. Many workstations over a LAN or a multiple-processor computer can be efficiently used to run this class ofalgorithms in parallel. With 60 workstations running in parallel (on a fast LAN), clustering 28.8 GBytesof 40 dimensional data into 100 clusters, theutilization of the computing units is above 80%.23 Pages
[发布日期] [发布机构] HP Development Company
[效力级别] [学科分类] 计算机科学(综合)
[关键词] parallel algorithms;data mining;data clustering;K-Means;K-Harmonic Means;Expectation Maximization [时效性]