Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 2 of 2
  • Item
    English-Persian phrase-based statistical machine translation : enhanced models, search and training : a thesis presented in fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Engineering at Massey University, Albany (Auckland), New Zealand
    (Massey University, 2012) Mohaghegh, Mahsa
    Machine translation (MT), as applied to natural language processing, has undergone substantial development over the past sixty years. While there are a number of different approaches to MT, there has been increasing interest in statistical machine translation (SMT) as the preferred approach to MT. Advances in computational power, together with the exploration of new methods and algorithms have enabled a general improvement in the output quality in a number of systems for various language pairs using this approach. However, there is a significant lack of research work in the area of English/Persian SMT, mainly due to the scarcity of data for this language pair, and the shortage of fundamental resources such as large-scale bilingual corpora. Several research studies have been published on work in the area of machine translation involving the Persian language; however, results producing fluent, usable output are rare. This thesis shows how SMT was implemented with this language pair for the first time, and how we created a cutting-edge hybrid SMT system capable of delivering high-quality translation output. We present the development of what is currently the largest English/Persian parallel corpus, constructed using a web crawler to source usable online data, together with the concatenation of existing parallel corpora. As yet another contribution of the research, we propose an improved hybrid corpus alignment method involving sentence length-based and word correspondence-based models to align words, phrases and sentences in the corpus. We also show the impact that the corpus domain can have on the translation output, and the necessity to consider domains of both bilingual and monolingual corpora where they are included in the training and language models. Two open-source toolkits, Moses and Joshua, were modified to work with the Persian language, and their behaviour and performance results were compared to determine which performed better when implemented with the Persian language. We present our work in designing, testing, and implementing a novel, three-level Transfer-based automatic post-editing (APE) component based on grammatical rules, which operates by analysing, parsing, and POS-tagging the output, and implements functions as transformers which perform corrections to the text, from lexical transformation to complex syntactical rearrangement. We show that rule-based approaches to the task of post-editing are superior to the commonly-used statistical models, since they incorporate linguistic knowledge, and are strong in terms of syntax, morphology, and structural semantics – qualities which are very desirable when performing grammatical correction and syntactical restructuring. We implement independent manual evaluation as well as standard automatic techniques, in order to assess more accurately the translation output. This evaluation shows that the use of the APE component is able to improve translation output significantly, that is, by at least 25%, resulting in high-quality translation output. Our system performs well by using a combination of the capabilities of two main MT approaches – SMT and RBMT – in different areas of the system as a whole. SMT provides the main system with consistent, mathematical-based translation, and the Transfer-based algorithm in the APE component operates with comprehensive linguistic rules in order to improve incorrect sentences, and fine-tune translation output. This results in a robust, state-of-the-art system, which noticeably exceeds other currently available solutions for this language pair.
  • Item
    Home and away : blogging emotions in a Persian virtual dowreh : a thesis presented in fulfilment of the requirements for the degree of Doctor of Philosophy in Linguistics and Second Language Teaching at Massey University
    (Massey University, 2011) Zare, Samad
    This study explores the creation of a virtual dowreh (family/social circle) via Persian language weblogs among a group of Iranian migrants in Australia. The motivation and inspiration for this study arose from my own experience as a migrant. I became interested in looking at how the new generation of Iranian migrants use weblogs to form digital diasporas and why they publish their emotional experiences online, thereby adding to the understanding of a relatively under-researched community. The study draws upon a sociocultural approach in order to bring to light the role of weblogs in the context of the most recent Iranian migration and the way Iranian migrants use them to replace dowrehs disrupted by the migration experience where they could perform cultural identities and express and share their emotions. Using a grounded theory approach and discourse analysis to blog posts, the study investigates the expression of emotional challenges, expectations, and cultural performances of a group of Persian diasporic bloggers. The exploration of a diasporic virtual dowreh produced several interesting results. The findings suggest the possibility of online community formation via weblogs where Iranians could meet and perform cultural identities which are not available to them in the host society. Two characteristics that marked the virtual dowreh were the type of Persian language used and the interaction between the bloggers and their audience. The analysis demonstrated that interactions between the bloggers and their audience via commenting functions were noticeably governed by Iranian notions of politeness and other Persian rules of decorum and cultural practices. The analysis also illustrated that the language used in the virtual dowreh was a combination of written and spoken Persian, Internet jargon, weblog terms, and concepts from the host society. Furthermore, the exploration of the emotional challenges of the bloggers revealed that certain emotions such as homesickness and self-conscious emotions were among the major sources of emotion in the diaspora and indexed the bloggers‟ Iranian diasporic identities online. The study concludes with the importance of weblogs for Iranian migrants in creating virtual dowrehs where they could practise/perform cultural identities and express and thereby share their emotional experience.