NON-PARAMETRIC GRAPH-BASED METHODS FOR LARGE SCALE PROBLEMS
[摘要] The notion of similarity between observations plays a very fundamental role in many Machine Learning and Data Mining algorithms. Inmany of these methods, the fundamental problem of prediction, which is making assessments and/or inferences about the future observations from thepast ones, boils down to how ``similar'' the future cases are to the already observed ones. However, similarity is not alwaysobtained through the traditional distance metrics. Data-driven similarity metrics, in particular, come into play where the traditional absolutemetrics are not sufficient for the task in hand due to special structure of the observed data. A common approach for computing data-driven similarityis to somehowaggregate the local absolute similarities (which are not data-driven and can be computed in a closed-from) to infer a globaldata-driven similarity value between any pair of observations. The graph-based methods offer a natural framework to do so. Incorporating these methods, many of the Machine Learning algorithms, that aredesigned to work with absolute distances, can be applied on those problems with data-driven distances. This makes graph-based methodsvery effective tools for many real-world problems.In this thesis, the major problem that I want to address is the scalability of the graph-based methods. With the rise of large-scale,high-dimensional datasets in many real-world applications, many Machine Learning algorithms do not scale up well applying to these problems. The graph-based methods are no exceptioneither. Both the large number of observations and the high dimensionality hurt graph-based methods,computationally and statistically. While the large number of observations imposes more of a computational problem, the high dimensionalityproblem has more of a statistical nature. In this thesis, I address both of these issues in depth and review the common solutions for them proposedin the literature. Moreover, for each of these problems, I propose novel solutions with experimental results depicting the merits of the proposedalgorithms. Finally, I discuss the contribution of the proposed work from a broader viewpoint and draw some future directions of the current work.
[发布日期] [发布机构] the University of Pittsburgh
[效力级别] [学科分类]
[关键词] [时效性]