• Login
    View Item 
    •   Home
    • Massey Documents by Type
    • Theses and Dissertations
    • View Item
    •   Home
    • Massey Documents by Type
    • Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    English-Persian phrase-based statistical machine translation : enhanced models, search and training : a thesis presented in fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Engineering at Massey University, Albany (Auckland), New Zealand

    Icon
    View/Open Full Text
    02_whole.pdf (2.665Mb)
    01_front.pdf (151.1Kb)
    Export to EndNote
    Abstract
    Machine translation (MT), as applied to natural language processing, has undergone substantial development over the past sixty years. While there are a number of different approaches to MT, there has been increasing interest in statistical machine translation (SMT) as the preferred approach to MT. Advances in computational power, together with the exploration of new methods and algorithms have enabled a general improvement in the output quality in a number of systems for various language pairs using this approach. However, there is a significant lack of research work in the area of English/Persian SMT, mainly due to the scarcity of data for this language pair, and the shortage of fundamental resources such as large-scale bilingual corpora. Several research studies have been published on work in the area of machine translation involving the Persian language; however, results producing fluent, usable output are rare. This thesis shows how SMT was implemented with this language pair for the first time, and how we created a cutting-edge hybrid SMT system capable of delivering high-quality translation output. We present the development of what is currently the largest English/Persian parallel corpus, constructed using a web crawler to source usable online data, together with the concatenation of existing parallel corpora. As yet another contribution of the research, we propose an improved hybrid corpus alignment method involving sentence length-based and word correspondence-based models to align words, phrases and sentences in the corpus. We also show the impact that the corpus domain can have on the translation output, and the necessity to consider domains of both bilingual and monolingual corpora where they are included in the training and language models. Two open-source toolkits, Moses and Joshua, were modified to work with the Persian language, and their behaviour and performance results were compared to determine which performed better when implemented with the Persian language. We present our work in designing, testing, and implementing a novel, three-level Transfer-based automatic post-editing (APE) component based on grammatical rules, which operates by analysing, parsing, and POS-tagging the output, and implements functions as transformers which perform corrections to the text, from lexical transformation to complex syntactical rearrangement. We show that rule-based approaches to the task of post-editing are superior to the commonly-used statistical models, since they incorporate linguistic knowledge, and are strong in terms of syntax, morphology, and structural semantics – qualities which are very desirable when performing grammatical correction and syntactical restructuring. We implement independent manual evaluation as well as standard automatic techniques, in order to assess more accurately the translation output. This evaluation shows that the use of the APE component is able to improve translation output significantly, that is, by at least 25%, resulting in high-quality translation output. Our system performs well by using a combination of the capabilities of two main MT approaches – SMT and RBMT – in different areas of the system as a whole. SMT provides the main system with consistent, mathematical-based translation, and the Transfer-based algorithm in the APE component operates with comprehensive linguistic rules in order to improve incorrect sentences, and fine-tune translation output. This results in a robust, state-of-the-art system, which noticeably exceeds other currently available solutions for this language pair.
    Date
    2012
    Author
    Mohaghegh, Mahsa
    Rights
    The Author
    Publisher
    Massey University
    URI
    http://hdl.handle.net/10179/4703
    Collections
    • Theses and Dissertations
    Metadata
    Show full item record

    Copyright © Massey University
    Contact Us | Send Feedback | Copyright Take Down Request | Massey University Privacy Statement
    DSpace software copyright © Duraspace
    v5.7-2020.1
     

     

    Tweets by @Massey_Research
    Information PagesContent PolicyDepositing content to MROCopyright and Access InformationDeposit LicenseDeposit License SummaryTheses FAQFile FormatsDoctoral Thesis Deposit

    Browse

    All of MROCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Statistics

    View Usage Statistics

    Copyright © Massey University
    Contact Us | Send Feedback | Copyright Take Down Request | Massey University Privacy Statement
    DSpace software copyright © Duraspace
    v5.7-2020.1