已收录 273785 条政策
 政策提纲
  • 暂无提纲
A Topic-Specific Web Crawler using Deep Convolutional Networks
[摘要] This paper presented a new focused crawler that efficiently supports the Turkish language. The developed architecture was divided into multiple units: a control unit, crawler unit, link extractor unit, link sorter unit, and natural language processing unit. The crawler's units can work in parallel to process the massive amount of published websites. Also, the proposed Convolutional Neural Network (CNN) based natural language processing unit can professionally classifying Turkish text and web pages. Extensive experiments using three datasets have been performed to illustrate the performance of the developed approach. The first dataset contains 50,000 Turkish web pages downloaded by the developed crawler, while the other two are publicly available and consist of “28,567” and “22,431” Turkish web pages, respectively. In addition, the Vector Space Model (VSM) in general and word embedding state-of-the-art techniques, in particular, were investigated to find the most suitable one for the Turkish language. Overall, results indicated that the developed approach had achieved good performance, robustness, and stability when processing the Turkish language. Also, Bidirectional Encoder Representations from Transformer (BERT) was found to be the most appropriate embedding for building an efficient Turkish language classification system. Finally, our experiments showed superior performance of the developed natural language processing unit against seven state-of-the-art CNN classification systems. Where accuracy improvement compared to the second-best is 10% and 47% compared to the lowest performance.
[发布日期]  [发布机构] 
[效力级别]  [学科分类] 计算机科学(综合)
[关键词] CNN;natural language processing;text classification;topic specific crawler;focused crawler;web crawling [时效性] 
   浏览次数:2      统一登录查看全文      激活码登录查看全文