已收录 273624 条政策
 政策提纲
  • 暂无提纲
SPOKEN DOCUMENT RETRIEVAL FOR TREC8 AT CAMBRIDGE UNIVERSITY
[摘要] This paper presents work done at Cambridge University on the TREC8 Spoken Document Retrieval (SDR) Track. The 500 hours of broadcast news audio was filtered using an automatic scheme for detecting commercials, and then transcribed using a 2pass HTK speech recogniser which ran at 13 times real time. The system gave an overall word error rate of 20.5% on the 10 hour scored subset of the corpus, the lowest in the track. Our retrieval engine used an Okapi scheme with traditional stopping and Porter stemming, enhanced with partofspeech weighting on query terms, a stemmer exceptions list, semantic ‘poset’ in dexing, parallel collection frequency weighting, both parallel and traditional blind relevance feedback and document expan sion using parallel blind relevance feedback. The final system gave an Average Precision of 55.29% on our transcriptions. For the case where story boundaries are unknown, a similar re trieval system, without the document expansion, was run on a set of “stories” derived from windowing the transcriptions after removal of commercials. Boundaries were forced at “commer cial” or “music” changes and some recombination of temporally close stories was allowed after retrieval. When scoring duplicate story hits and commercials as irrelevant, this system gave an Av erage Precision of 41.47% on our transcriptions. The paper also presents results for crossrecogniser experiments using our retrieval strategies on transcriptions from our own first pass output, AT&T, CMU, 2 NISTrun BBN baselines, LIMSI and Sheffield University, and the relationship between perfor
[发布日期]  [发布机构] 
[效力级别]  [学科分类] 社会科学、人文和艺术(综合)
[关键词]  [时效性] 
   浏览次数:3      统一登录查看全文      激活码登录查看全文