Biblio: Automatic meta-data extraction
[摘要] Biblio is an adaptive system that automatically extracts meta-data from semi. structured and structured scanned documents. Instead of using hand- coded templates or other methods manually customized for each given document format, it uses example-based machine learning to adapt to customer-defined document and meta-data types. We provide results from two document corpuses, a set of scanned journal articles and a set of scanned legal documents. The first set is semi-structured, as the different journals use a variety of flexible layouts. The second set is largely free-form text based on poor quality scans of FAX- quality legal documents. We demonstrate accuracy on the semi-structured document set roughly comparable to hand-coded systems, and much worse performance on the legal documents. 26 Pages
[发布日期] [发布机构] HP Development Company
[效力级别] [学科分类] 计算机科学(综合)
[关键词] document understanding;learning;support vector machines;neural networks [时效性]