Statistical pattern recognition approaches for retrieval-based machine translation systems
Mansjur, Dwi Sianto
MetadataShow full item record
This dissertation addresses the problem of Machine Translation (MT), which is defined as an automated translation of a document written in one language (the source language) to another (the target language) by a computer. The MT task requires various types of knowledge of both the source and target language, e.g., linguistic rules and linguistic exceptions. Traditional MT systems rely on an extensive parsing strategy to decode the linguistic rules and use a knowledge base to encode those linguistic exceptions. However, the construction of the knowledge base becomes an issue as the translation system grows. To overcome this difficulty, real translation examples are used instead of a manually-crafted knowledge base. This design strategy is known as the Example-Based Machine Translation (EBMT) principle. Traditional EBMT systems utilize a database of word or phrase translation pairs. The main challenge of this approach is the difficulty of combining the word or phrase translation units into a meaningful and fluent target text. A novel Retrieval-Based Machine Translation (RBMT) system, which uses a sentence-level translation unit, is proposed in this study. An advantage of using the sentence-level translation unit is that the boundary of a sentence is explicitly defined and the semantic, or meaning, is precise in both the source and target language. The main challenge of using a sentential translation unit is the limited coverage, i.e., the difficulty of finding an exact match between a user query and sentences in the source database. Using an electronic dictionary and a topic modeling procedure, we develop a procedure to obtain clusters of sensible variations for each example in the source database. The coverage of our MT system improves because an input query text is matched against a cluster of sensible variations of translation examples instead of being matched against an original source example. In addition, pattern recognition techniques are used to improve the matching procedure, i.e., the design of optimal pattern classifiers and the incorporation of subjective judgments. A high performance statistical pattern classifier is used to identify the target sentences from an input query sentence in our MT system. The proposed classifier is different from the conventional classifier in terms of the way it addresses the generalization capability. A conventional classifier addresses the generalization issue using the parsimony principle and may encounter the possibility of choosing an oversimplified statistical model. The proposed classifier directly addresses the generalization issue in terms of training (empirical) data. Our classifier is expected to generalize better than the conventional classifiers because our classifier is less likely to use over-simplified statistical models based on the available training data. We further improve the matching procedure by the incorporation of subjective judgments. We formulate a novel cost function that combines subjective judgments and the degree of matching between translation examples and an input query. In addition, we provide an optimization strategy for the novel cost function so that the statistical model can be optimized according to the subjective judgments.