Probabilistic Methods for Distributed Learning

[摘要]

Access to data at massive scale has proliferated recently. A significant machine learning challenge concerns development of methods that efficiently model and learn from data at this scale, while retaining analysis flexibility and sophistication.

Many statistical learning problems are formulated in terms of regularized empirical risk minimization [15]. To scale this method to big data that are becoming commonplace in various applications, it is desirable to efficiently extend empirical risk minimization to a large-scale setting. When the size of the data is too large to be stored on a single machine, or at least too large to keep in a single localized memory, one popular solution is to store and process the data in a distributed manner. Consequently, the focus of this dissertation is to study distributed learning algorithms [3] for empirical risk minimization problems.

Toward this end we propose a series of probabilistic methods for divide-and-conquer distributed learning, with these methods accounting for an increasing set of challenges. The basic Maximum Entropy Mixture (MEM) method is first proposed, to model uncertainty caused by randomly partitioning the data across computing nodes. We then develop a hierarchical extension to MEM, termed hMEM, facilitating sharing of statistical strength among data blocks. Finally, to addresses small sample bias, we impose the constraint that the mean of inferred parameters is the same across all data blocks, yielding a hierarchical MEM with expectation constraint (termed hecMEM). Computations are performed with a generalized Expectation-Maximization algorithm. The hecMEM method achieves state-of-the-art results for distributed matrix completion and logistic regression at massive scale, with comparisons made to MEM, hMEM and several alternative approaches.

[发布日期] [发布机构]

[效力级别] [学科分类]

[关键词] [时效性]

浏览次数：32

统一登录查看全文激活码登录查看全文