Investigation of training data size for real-time neural vocoders on CPUs
[摘要] In recent years, neural speech synthesis technology in text-to-speech has been developed rapidly. WaveNet [1], which is an autoregressive generative raw audio model, was a great turning point in recent developments. Neural vocoders such as the WaveNet vocoder [2] have achieved much better quality than traditional source-filter vocoders [3]. In particular, Tacotron 2 with a WaveNet vocoder can synthesize high quality speech that is indistinguishable from natural speech [4]. Because of the autoregressive architecture, the WaveNet vocoder has a problem with the slow inference speed and thus is limited in its application. To solve this problem, various neural vocoder models have been proposed, and some can synthesize speech waveforms in real-time, even in a restricted environment with mobile CPUs [5,6]. These models are mostly trained on large sets of training data of more than 10 hours. In a real scenario, however, there are many cases in where it is difficult to collect such a large set of training data. Therefore, it is important to investigate how many training data are required for real-time neural vocoders.
[发布日期] [发布机构]
[效力级别] [学科分类] 声学和超声波
[关键词] Speech synthesis ,Neural vocoder ,LPCNet ,Parallel WaveGAN [时效性]