English-Persian phrase-based statistical machine translation : enhanced models, search and training : a thesis presented in fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Engineering at Massey University, Albany (Auckland), New Zealand
Machine translation (MT), as applied to natural language processing, has
undergone substantial development over the past sixty years. While there are a
number of different approaches to MT, there has been increasing interest in statistical
machine translation (SMT) as the preferred approach to MT. Advances in
computational power, together with the exploration of new methods and algorithms
have enabled a general improvement in the output quality in a number of systems for
various language pairs using this approach. However, there is a significant lack of
research work in the area of English/Persian SMT, mainly due to the scarcity of data
for this language pair, and the shortage of fundamental resources such as large-scale
bilingual corpora. Several research studies have been published on work in the area of
machine translation involving the Persian language; however, results producing
fluent, usable output are rare.
This thesis shows how SMT was implemented with this language pair for the first
time, and how we created a cutting-edge hybrid SMT system capable of delivering
high-quality translation output.
We present the development of what is currently the largest English/Persian parallel
corpus, constructed using a web crawler to source usable online data, together with
the concatenation of existing parallel corpora. As yet another contribution of the
research, we propose an improved hybrid corpus alignment method involving
sentence length-based and word correspondence-based models to align words, phrases
and sentences in the corpus. We also show the impact that the corpus domain can
have on the translation output, and the necessity to consider domains of both bilingual
and monolingual corpora where they are included in the training and language
Two open-source toolkits, Moses and Joshua, were modified to work with the Persian
language, and their behaviour and performance results were compared to determine
which performed better when implemented with the Persian language.
We present our work in designing, testing, and implementing a novel, three-level
Transfer-based automatic post-editing (APE) component based on grammatical rules,
which operates by analysing, parsing, and POS-tagging the output, and implements
functions as transformers which perform corrections to the text, from lexical
transformation to complex syntactical rearrangement. We show that rule-based
approaches to the task of post-editing are superior to the commonly-used statistical
models, since they incorporate linguistic knowledge, and are strong in terms of
syntax, morphology, and structural semantics – qualities which are very desirable
when performing grammatical correction and syntactical restructuring.
We implement independent manual evaluation as well as standard automatic
techniques, in order to assess more accurately the translation output. This evaluation
shows that the use of the APE component is able to improve translation output
significantly, that is, by at least 25%, resulting in high-quality translation output.
Our system performs well by using a combination of the capabilities of two main MT
approaches – SMT and RBMT – in different areas of the system as a whole. SMT
provides the main system with consistent, mathematical-based translation, and the
Transfer-based algorithm in the APE component operates with comprehensive
linguistic rules in order to improve incorrect sentences, and fine-tune translation
output. This results in a robust, state-of-the-art system, which noticeably exceeds
other currently available solutions for this language pair.