Comparison of machine learning algorithms and acoustic features in emotion recognition from spontaneous speech
[摘要] Unlike the recognition of emotion category from acted emotional speech, estimating emotion from spontaneous speech is still a difficult task. There have been a lot of studies on emotion recognition, vast majority of which are for acted or exaggerated emotional speech dataset [1–7]. Reportedly, the accuracy for a seven-class classification is only about 33.4% in UAR (unweighted average recall) [1], and the one for a nine-class classification is only about 21.7% in UAR [8]. By the way, recent studies revealed that deep neural models can implicitly learn to extract features that are useful to estimate emotions from speech signal [2]. For a small amount of data, however, it is known that hand-crafted features are often outperforming the methods with deep representation learning, and still useful [9,10]. This may also apply to recognizing emotion from spontaneous speech, because existing public emotion-labeled spontaneous speech corpora are all small in size: 1,018 utterances in the VAM [11], 1,308 utterances in the RECOLA [12], and 4,784 utterances in the ‘‘improvised’’ portion of the IEMOCAP.
[发布日期] [发布机构]
[效力级别] [学科分类] 声学和超声波
[关键词] Emotion recognition;Machine learning;Emotional speech corpus;Spontaneous speech [时效性]