已收录 268921 条政策
 政策提纲
  • 暂无提纲
Evaluation and development of conceptual document similarity metrics with content-based recommender applications
[摘要] ENGLISH ABSTRACT: The World Wide Web brought with it an unprecedented level of information overload.Computers are very effective at processing and clustering numerical and binary data,however, the automated conceptual clustering of natural-language data is considerablyharder to automate. Most past techniques rely on simple keyword-matching techniquesor probabilistic methods to measure semantic relatedness. However, these approaches donot always accurately capture conceptual relatedness as measured by humans.In this thesis we propose and evaluate the use of novel Spreading Activation (SA)techniques for computing semantic relatedness, by modelling the article hyperlink structureof Wikipedia as an associative network structure for knowledge representation. TheSA technique is adapted and several problems are addressed for it to function over theWikipedia hyperlink structure. Inter-concept and inter-document similarity metrics aredeveloped which make use of SA to compute the conceptual similarity between two conceptsand between two natural-language documents. We evaluate these approaches overtwo document similarity datasets and achieve results which compare favourably with thestate of the art.Furthermore, document preprocessing techniques are evaluated in terms of the performancegain these techniques can have on the well-known cosine document similarity metricand the Normalised Compression Distance (NCD) metric. Results indicate that a neartwo-fold increase in accuracy can be achieved for NCD by applying simple preprocessingtechniques. Nonetheless, the cosine similarity metric still significantly outperforms NCD.Finally, we show that using our Wikipedia-based method to augment the cosine vectorspace model provides superior results to either in isolation. Combining the two methodsleads to an increased correlation of Pearson p = 0:72 over the Lee (2005) document similaritydataset, which matches the reported result for the state-of-the-art Explicit SemanticAnalysis (ESA) technique, while requiring less than 10% of the Wikipedia database asrequired by ESA.As a use case for document similarity techniques, a purely content-based news-articlerecommender system is designed and implemented for a large online media company.This system is used to gather additional human-generated relevance ratings which weuse to evaluate the performance of three state-of-the-art document similarity metrics forproviding content-based document recommendations.
[发布日期]  [发布机构] Stellenbosch University
[效力级别]  [学科分类] 
[关键词]  [时效性] 
   浏览次数:4      统一登录查看全文      激活码登录查看全文