Active duplicate detection with Bayesian nonparametric models

[摘要] When multiple databases are merged, an essential step is identifying sets of records that refer to the same entity. Called duplicate detection, this task is typically tedious to perform manually, and so a variety of automated methods have been developed for partitioning a collection of records into coreference sets. This task is complicated by ambiguous or noisy field values, so systems are typically domain-specific and often fitted to a representative labeled training corpus. Once fitted, such systems can estimate a partition of a similar corpus without human intervention. While this approach has many applications, it is often infeasible to encode the appropriate domain knowledge a priori or to identify suitable training data. To address such cases, this thesis uses an active framework for duplicate detection, wherein the system initially estimates a partition of a test corpus without training, but is then allowed to query a human user about the coreference labeling of a portion of the corpus. The responses to these queries are used to guide the system in producing improved partition estimates and further queries of interest. This thesis describes a complete implementation of this framework with three technical contributions: a domain-independent Bayesian model expressing the relationship between the unobserved partition and the observed field values of a set of database records; a criterion for picking informative queries based on the mutual information between the response and the unobserved partition; and an algorithm for estimating a minimum-error partition under a Bayesian model through a reduction to the well-studied problem of correlation clustering. It also present experimental results demonstrating the effectiveness of this method in a variety of data domains.

[发布日期] [发布机构] Massachusetts Institute of Technology

[效力级别] [学科分类]

[关键词] [时效性]

浏览次数：3

统一登录查看全文激活码登录查看全文