Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 5 of 5
  • Item
    Deep learning for low-resource machine translation : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand. EMBARGOED until further notice.
    (Massey University, 2025-09-01) Gao, Yuan
    Machine translation, a key task in natural language processing, aims to automatically translate text from one language to another while preserving semantic integrity. This thesis builds upon existing research and introduces three deep-learning methods to enhance translation performance under low-resource conditions: (i) an effective transfer learning framework that leverages knowledge from high-resource language pairs, (ii) a pre-ordering-aware training method that explicitly utilizes contextualized representations of pre-ordered sentences, and (iii) a data augmentation strategy that expands the training data size. Firstly, we develop a two-step fine-tuning (TSFT) transfer learning framework for low-resource machine translation. Due to the inherent linguistic divergence between languages in parent (high-resource language pairs) and child (low-resource language pairs) translation tasks, the parent model often serves as a suboptimal initialization point for directly fine-tuning the child model. Our TSFT framework addresses this limitation by incorporating a pre-fine-tuning stage that adapts the parent model to the child source language characteristics, improving child model initialization and overall translation quality. Secondly, we propose a training method that enables the model to learn pre-ordering knowledge and encode the word reordering information within the contextualized representation of source sentences. Pre-ordering refers to rearranging source-side words to better align with the target-side word order before translation, which helps mitigate word-order differences between languages. Existing methods typically integrate the information of pre-ordered source sentences at the token level, where each token is assigned a local representation that fails to capture broader contextual dependencies. Moreover, these methods still require pre-ordered sentences during inference, which incur additional inference costs. In contrast, our method enables the model to encode the pre-ordering information in the contextualized representations of source sentences. In addition, our method eliminates the need for pre-ordering sentences at inference time while preserving its benefits in improving translation quality. Thirdly, to address data scarcity in low-resource scenarios, we propose a data augmentation strategy that employs high-quality translation models trained bidirectionally on high-resource language pairs. This strategy generates diverse, high-fidelity pseudo-training data through systematic sentence rephrasing, generating multiple target translations for each source sentence.. The increased diversity on the target side enhances the model's robustness, as demonstrated by significant performance improvements in eight pairs of low-resource languages. Finally, we conduct an empirical study to explore the potential of applying ChatGPT for machine translation. We design a set of translation prompts incorporating various auxiliary information to assist ChatGPT in generating translations. Our findings indicate that, with carefully designed prompts, ChatGPT can achieve results comparable to those of commercial translation systems for high-resource languages. Moreover, this study establishes a foundation for future research, offering insights into prompt engineering strategies for leveraging large language models in machine translation tasks.
  • Item
    Cross-lingual learning in low-resource : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Natural and Computational Sciences, Massey University, Auckland, New Zealand
    (Massey University, 2022) Zhao, Jiawei
    Current machine translation techniques were developed using predominantly rich resource language pairs. However, there is a broader range of languages used in practice around the world. For instance, machine translation between Finnish, Chinese and Russian is still not suitable for high-quality communication. This dissertation focuses on building cross-lingual models to address this issue. I aim to analyse the relationships between embeddings of different languages, especially low-resource languages. I investigate four phenomena that can improve the translation of low-resource languages. The first study concentrates on the non-linearity of cross-lingual word embeddings. Current approaches primarily focus on linear mapping between the word embeddings of different languages. However, those approaches don't seem to work as well with some language pairs, mostly if the two languages belong to different language families, e.g. English and Chinese. I hypothesise that linearity, which is often assumed in the geometric relationship between monolingual word embeddings of different languages, may not hold for all language pairs. I focus on investigating the relationship between word embeddings of languages in different language families. I show that non-linearity can better describe the relationship in those language pairs using multiple datasets. The second study focuses on the unsupervised cross-lingual word embeddings for low-resource languages. Conventional approach to constructing cross-lingual word embeddings requires a large dictionary, which is hard to obtain for low-resource languages. I propose an unsupervised approach to learning cross-lingual word embeddings for low-resource languages. By incorporating kernel canonical correlation analysis, the proposed approach can better learn high-quality cross-lingual word embeddings in an unsupervised scenario. The third study investigates a dictionary augmentation technique for low-resource languages. A key challenge for constructing an accurately augmented dictionary is the high variance issue. I propose a semi-supervised method that can bootstrap a small dictionary into a larger high-quality dictionary. The fourth study concentrates on the data insufficiency issue in speech translation. The lack of training data availability for low-resource languages limits the performance of end-to-end speech translation. I investigate the use of knowledge distillation to transfer knowledge from the machine translation task to the speech translation task and propose a new training methodology. The results and analyses presented in this work show that a wide range of techniques can address issues that arise with low-resource languages in the machine translation field. This dissertation provides a deeper insight into understanding the word representations and structures in low-resource translation and should aid future researchers to better utilise their translation models.
  • Item
    Segmentation of continuous sign language : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Engineering, Massey University, Palmerston North, New Zealand
    (Massey University, 2014) Khan, Shujjat
    Sign language is a natural language of deaf people comprising of hand gestures, facial expressions and body postures. It has all the constituents that are normally attributed to a natural language, such as variations, lexical/semantic processes, coarticulations, regional dialects, and all the linguistic features required for a successful communication. However, sign language is an alien language for a vast majority of the hearing community so there is a large communication barrier between both the sides. To bridge this gap, sign language interpreting services are provided at various public places like courts, hospitals and airports. Apart from the special needs, the digital divide is also growing for the deaf people because most of the existing voice-based technologies and services are completely useless for the deaf. Many attempts have been made to develop an automatic sign language interpreter that can understand a sign discourse and translate it into speech and vice-versa. Unfortunately, existing solutions are designed with tight constraints so they are only suitable for use in a controlled environment (like laboratories). These conditions include specialized lighting, fixed background and many restrictions on the signing style like slow gestures, exaggerated or artificial pause between the signs and wearing special gloves. In order to develop a useful translator these challenges must be addressed so that it could be installed at any public place. In this research, we have investigated the main challenges of a practical sign language interpreting system and their existing solutions. We have also proposed new solutions (like robust articulator detection, sign segmentation, and availability of reliable scientific data) and compared them with the existing ones. Our analysis suggests that the major challenge with existing solutions is that they are not equipped to address the varying needs of the operational environments. Therefore, we designed the algorithms in a way that they stay functional in dynamic environments. In the experiments, our proposed articulator segmentation technique and boundary detection method have outperformed all the existing static approaches when tested in a practical situation. Through these findings, we do not attempt to claim a superior performance of our algorithms in terms of the quantitative results; however, system testing in practical places (offices) asserts that our solutions can give consistent results in dynamic environments in comparison to the existing solutions. Temporal segmentation of continuous sign language is a new area which is mainly addressed by this thesis. Based on the conceptual underpinnings of this field, a novel tool called DAD signature has been proposed and tested on real sign language data. This segmentation tool has been proven useful for sign boundary detection using the segmentation features (pauses, repetitions and directional variations) embedded in a sign stream. The DAD signature deciphers these features and provides reliable word boundaries of sentences recorded in a practical environment. Unlike the existing boundary detectors, the DAD approach does not rely on the artificial constraints (like slow signing, external trigger or exaggerated prosody) that restrict the usability of an interpreting system. This makes DAD viable for practical sign language interpreting solutions. As signified in this dissertation, the development of the much awaited useful sign language interpreter is achievable now. We have established that by making use of our proposed techniques, the strict design constraints of the existing interpreters can be mitigated without affecting the overall system performance in a public place. In a nutshell, our research is a step forward towards the possibility of turning the idea of a practical automatic interpreter into a reality.
  • Item
    English-Persian phrase-based statistical machine translation : enhanced models, search and training : a thesis presented in fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Engineering at Massey University, Albany (Auckland), New Zealand
    (Massey University, 2012) Mohaghegh, Mahsa
    Machine translation (MT), as applied to natural language processing, has undergone substantial development over the past sixty years. While there are a number of different approaches to MT, there has been increasing interest in statistical machine translation (SMT) as the preferred approach to MT. Advances in computational power, together with the exploration of new methods and algorithms have enabled a general improvement in the output quality in a number of systems for various language pairs using this approach. However, there is a significant lack of research work in the area of English/Persian SMT, mainly due to the scarcity of data for this language pair, and the shortage of fundamental resources such as large-scale bilingual corpora. Several research studies have been published on work in the area of machine translation involving the Persian language; however, results producing fluent, usable output are rare. This thesis shows how SMT was implemented with this language pair for the first time, and how we created a cutting-edge hybrid SMT system capable of delivering high-quality translation output. We present the development of what is currently the largest English/Persian parallel corpus, constructed using a web crawler to source usable online data, together with the concatenation of existing parallel corpora. As yet another contribution of the research, we propose an improved hybrid corpus alignment method involving sentence length-based and word correspondence-based models to align words, phrases and sentences in the corpus. We also show the impact that the corpus domain can have on the translation output, and the necessity to consider domains of both bilingual and monolingual corpora where they are included in the training and language models. Two open-source toolkits, Moses and Joshua, were modified to work with the Persian language, and their behaviour and performance results were compared to determine which performed better when implemented with the Persian language. We present our work in designing, testing, and implementing a novel, three-level Transfer-based automatic post-editing (APE) component based on grammatical rules, which operates by analysing, parsing, and POS-tagging the output, and implements functions as transformers which perform corrections to the text, from lexical transformation to complex syntactical rearrangement. We show that rule-based approaches to the task of post-editing are superior to the commonly-used statistical models, since they incorporate linguistic knowledge, and are strong in terms of syntax, morphology, and structural semantics – qualities which are very desirable when performing grammatical correction and syntactical restructuring. We implement independent manual evaluation as well as standard automatic techniques, in order to assess more accurately the translation output. This evaluation shows that the use of the APE component is able to improve translation output significantly, that is, by at least 25%, resulting in high-quality translation output. Our system performs well by using a combination of the capabilities of two main MT approaches – SMT and RBMT – in different areas of the system as a whole. SMT provides the main system with consistent, mathematical-based translation, and the Transfer-based algorithm in the APE component operates with comprehensive linguistic rules in order to improve incorrect sentences, and fine-tune translation output. This results in a robust, state-of-the-art system, which noticeably exceeds other currently available solutions for this language pair.
  • Item
    Fluency enhancement : applications to machine translation : thesis for Master of Engineering in Information & Telecommunications Engineering, Massey University, Palmerston North, New Zealand
    (Massey University, 2009) Manion, Steve Lawrence
    The quality of Machine Translation (MT) can often be poor due to it appearing incoherent and lacking in fluency. These problems consist of word ordering, awkward use of words and grammar, and translating text too literally. However we should not consider translations such as these failures until we have done our best to enhance their quality, or more simply, their fluency. In the same way various processes can be applied to touch up a photograph, various processes can also be applied to touch up a translation. This research outlines the improvement of MT quality through the application of Fluency Enhancement (FE), which is a process we have created that reforms and evaluates text to enhance its fluency. We have tested our FE process on our own MT system which operates on what we call the SAM fundamentals, which are as follows: Simplicity - to be simple in design in order to be portable across different languages pairs, Adaptability - to compensate for the evolution of language, and Multiplicity - to determine a final set of translations from as many candidate translations as possible. Based on our research, the SAM fundamentals are the key to developing a successful MT system, and are what have piloted the success of our FE process.