Combining linguistics and statistics for high-quality limited domain English-Chinese machine translation
[摘要] Second language learning is a compelling activity in today;;s global markets. This thesis focuses on critical technology necessary to produce a computer spoken translation game for learning Mandarin Chinese in a relatively broad travel domain. Three main aspects are addressed: efficient Chinese parsing, high-quality English-Chinese machine translation, and how these technologies can be integrated into a translation game system. In the language understanding component, the TINA parser is enhanced with bottom-up and long distance constraint features. The results showed that with these features, the Chinese grammar ran ten times faster and covered 15% more of the test set. In the machine translation component, a combined method of linguistic and statistical system is introduced. The English-Chinese translation is done via an intermediate language ;;Zhonglish;;, where the English-Zhonglish translation is accomplished by a parse-and-paraphrase paradigm using hand-coded rules, mainly for structural reconstruction. Zhonglish-Chinese translation is accomplished by a standard phrase based statistical machine translation system, mostly accomplishing word sense disambiguation and lexicon mapping. We evaluated in an independent test set in IWSLT travel domain spoken language corpus. Substantial improvements were achieved for GIZA alignment crossover: we obtained a 45% decrease in crossovers compared to a traditional phrase-based statistical MT system. Furthermore, the BLEU score improved by 2 points. Finally, a framework of the translation game system is described, and the feasibility of integrating the components to produce reference translation and to automatically assess student;;s translation is verified.
[发布日期] [发布机构] Massachusetts Institute of Technology
[效力级别] [学科分类]
[关键词] [时效性]