On boosting and noisy labels

[摘要] Boosting is a machine learning technique widely used across many disciplines. Boosting enables one to learn from labeled data in order to predict the labels of unlabeled data. A central property of boosting instrumental to its popularity is its resistance to overfitting. Previous experiments provide a margin-based explanation for this resistance to overfitting. In this thesis, the main finding is that boosting;;s resistance to overfitting can be understood in terms of how it handles noisy (mislabeled) points. Confirming experimental evidence emerged from experiments using the Wisconsin Diagnostic Breast Cancer(WDBC) dataset commonly used in machine learning experiments. A majority vote ensemble filter identified on average that 2.5% of the points in the dataset as noisy. The experiments chiefly investigated boosting;;s treatment of noisy points from a volume-based perspective. While the cell volume surrounding noisy points did not show a significant difference from other points, the decision volume surrounding noisy points was two to three times less than that of non-noisy points. Additional findings showed that decision volume not only provides insight into boosting;;s resistance to overfitting in the context of noisy points, but also serves as a suitable metric for identifying which points in a dataset are likely to be mislabeled.

[发布日期] [发布机构] Massachusetts Institute of Technology

[效力级别] [学科分类]

[关键词] [时效性]

浏览次数：3

统一登录查看全文激活码登录查看全文