Counting Positives Accurately Despite Inaccurate Classification
[摘要] Most supervised machine learning research assumes the training set is a random sample from the target population, thus the class distribution is invariant. In real world situations, however, the class distribution changes, and is known to erode the effectiveness of classifiers and calibrated probability estimators. This paper focuses on the problem of accurately estimating the number of positives in the test set--quantification--as opposed to classifying individual cases accurately. It compares three methods: classify & count, an adjusted variant, and a mixture model. An empirical evaluation on a text classification benchmark reveals that the simple method is consistently biased, and that the mixture model is surprisingly effective even when positives are very scarce in the training set--a common case in information retrieval. Notes: Copyright 2005 Springer-Verlag. Published in and presented at the 16th European Conference on Machine Learning (ECML'05), 3-7 October 2005, Porto, Portugal http://ecmlpkdd05.liacc.up.pt/ 12 Pages
[发布日期] [发布机构] HP Development Company
[效力级别] [学科分类] 计算机科学(综合)
[关键词] supervised machine learning;estimation;mixture models;shifting class prior;non-stationary class distribution [时效性]