Fast accurate diphone-based phoneme recognition
[摘要] Statistical speech recognition systems typically utilise a set of statistical models of subwordunits based on the set of phonemes in a target language. However, in continuousspeech it is important to consider co-articulation e ects and the interactions betweenneighbouring sounds, as over-generalisation of the phonetic models can negatively a ectsystem accuracy. Traditionally co-articulation in continuous speech is handled by incorporatingcontextual information into the subword model by means of context-dependentmodels, which exponentially increase the number of subword models. In contrast, transitionalmodels aim to handle co-articulation by modelling the interphone dynamics foundin the transitions between phonemes.This research aimed to perform an objective analysis of diphones as subword units foruse in hidden Markov model-based continuous-speech recognition systems, with specialemphasis on a direct comparison to a context-dependent biphone-based system in termsof complexity, accuracy and computational e ciency in similar parametric conditions. Tosimulate practical conditions, the experiments were designed to evaluate these systemsin a low resource environment { limited supply of training data, computing power andsystem memory { while still attempting fast, accurate phoneme recognition.Adaptation techniques designed to exploit characteristics inherent in diphones, aswell as techniques used for e ective parameter estimation and state-level tying were usedto reduce resource requirements while simultaneously increasing parameter reliability.These techniques include diphthong splitting, utilisation of a basic diphone grammar,diphone set completion, maximum a posteriori estimation and decision-tree based stateclustering algorithms. The experiments were designed to evaluate the contribution of eachadaptation technique individually and subsequently compare the optimised diphone-basedrecognition system to a biphone-based recognition system that received similar treatment.Results showed that diphone-based recognition systems perform better than both traditionalphoneme-based systems and context-dependent biphone-based systems when evaluatedin similar parametric conditions. Therefore, diphones are e ective subword units,which carry suprasegmental knowledge of speech signals and provide an excellent compromisebetween detailed co-articulation modelling and acceptable system performance
[发布日期] [发布机构] Stellenbosch University
[效力级别] [学科分类]
[关键词] [时效性]