Automatic acquisition of two-level morphological rules

[摘要] ENGLISH SUMMARY: There are numerous applications for computational systems with a natural language processingcapability. All these applications, which include free-text information retrieval, machine-translationand computer-assisted language learning, require a detailed and correctly structureddatabase (or lexicon) of language information on all the levels of language analysis(phonology, morphology, syntax, semantics, etc.). To hand-code this information can betime-consuming and error prone. An alternative approach is to attempt the automation ofthe lexicon construction process. The contribution of this thesis is to present a method toautomatically construct rule sets for the morphological and phonological levels of languageanalysis. The particular computational morphological framework used is that of two-levelmorphology. The lexicon, which contains the language specific information of two-level analyzers/generators, consists of two components: (1) A morphotactic description of the wordsto be processed, as well as (2) a set of two-level phonological (or spelling) rules. The inputto the acquisition process is source-target word pairs, where the target is an inflected formof the source word. It is assumed that the target word is formed from the source throughthe optional addition of a prefix and/or a suffix. There are two phases in the acquisitionprocess: (1) segmentation of the target into morphemes and (2) determination of the optimaltwo-level rule set with minimal discerning contexts. In phase one, an acyclic deterministicfinite state automaton (ADFSA) is constructed from string edit sequences of the input pairs.Segmentation of the words into morphemes is achieved through viewing the ADFSA as adirected acyclic graph (DAG) and applying heuristics using properties of the DAG as well asthe elementary string edit operations. For phase two, the determination of the optimal ruleset is made possible with a novel representation of rule contexts, with morpheme boundariesadded, in a new DAG. We introduce the notion of a delimiter edge. Delimiter edges are used to select the correct two-level rule type as well as to extract minimal discerning rule contextsfrom the DAG. To illustrate the language independence of an acquired rule set, results arepresented for English adjectives, Xhosa noun locatives, Afrikaans noun plurals and Spanishadjectives. Furthermore, it is shown how rules are acquired from thousands of input source targetword pairs. Finally, the excellent generalization of an acquired rule set is shown byapplying a slightly manually modified rule set to previously unseen words. The recognitionaccuracy on unseen words was 98.9% while the generation accuracy was 97.8%.

[发布日期] [发布机构] Stellenbosch University

[效力级别] [学科分类]

[关键词] [时效性]

浏览次数：3

统一登录查看全文激活码登录查看全文