Audio-Visual Person Recognition Using Deep Convolutional Neural Networks

[摘要] Protection of data integrity and person identity has been an active research area for many years. Among the techniques investigated, developing multi-modal recognition systems using audio and face signals for people authentication holds a promising future due to its ease of use. A challenge in developing such a multi-modal recognition system is to improve its reliability for a practical application. In this paper, an efficient audio-visual bimodal recognition system which uses Deep Convolution Neural Networks (CNNs) as a primary model architecture. First, two separate Deep CNN models are trained with the help of audio and facial features, respectively. The outputs of these CNN models are then combined/fused to predict the identity of the subject. Implementation details with regard to data fusion are discussed in a great length in the paper. Through experimental verification, the proposed bimodal fusion approach is superior in accuracy performance when compared with any single modal recognition systems and with published results using the same data-set.

[发布日期] [发布机构]

[效力级别] [学科分类]

[关键词] CNN;Face recognition;Mel-spectrogram;Multi-modal;Speaker recognition;VGG16 model [时效性]

浏览次数：2

统一登录查看全文激活码登录查看全文