Unsupervised clustering of audio data for acoustic modelling in automatic speech recognition systems

[摘要] ENGLISH ABSTRACT: This thesis presents a system that is designed to replace the manual process ofgenerating a pronunciation dictionary for use in automatic speech recognition.The proposed system has several stages.The first stage segments the audio into what will be known as the subwordunits, using a frequency domain method. In the second stage, dynamictime warping is used to determine the similarity between the segments of eachpossible pair of these acoustic segments. These similarities are used to clustersimilar acoustic segments into acoustic clusters. The final stage derives apronunciation dictionary from the orthography of the training data and correspondingsequence of acoustic clusters. This process begins with an initialmapping between words and their sequence of clusters, established by Viterbialignment with the orthographic transcription. The dictionary is refined iterativelyby pruning redundant mappings, hidden Markov model estimation andViterbi re-alignment in each iteration.This approach is evaluated experimentally by applying it to two subsets ofthe TIMIT corpus. It is found that, when test words are repeated often in thetraining material, the approach leads to a system whose accuracy is almost asgood as one trained using the phonetic transcriptions. When test words arenot repeated often in the training set, the proposed approach leads to betterresults than those achieved using the phonetic transcriptions, although therecognition is poor overall in this case.

[发布日期] [发布机构] Stellenbosch University

[效力级别] [学科分类]

[关键词] [时效性]

浏览次数：3

统一登录查看全文激活码登录查看全文