Massey Documents by Type
Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294
Browse
2 results
Search Results
Item Deep learning for low-resource machine translation : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical and Computational Sciences, Massey University, Albany, Auckland, New Zealand. EMBARGOED until further notice.(Massey University, 2025-09-01) Gao, YuanMachine translation, a key task in natural language processing, aims to automatically translate text from one language to another while preserving semantic integrity. This thesis builds upon existing research and introduces three deep-learning methods to enhance translation performance under low-resource conditions: (i) an effective transfer learning framework that leverages knowledge from high-resource language pairs, (ii) a pre-ordering-aware training method that explicitly utilizes contextualized representations of pre-ordered sentences, and (iii) a data augmentation strategy that expands the training data size. Firstly, we develop a two-step fine-tuning (TSFT) transfer learning framework for low-resource machine translation. Due to the inherent linguistic divergence between languages in parent (high-resource language pairs) and child (low-resource language pairs) translation tasks, the parent model often serves as a suboptimal initialization point for directly fine-tuning the child model. Our TSFT framework addresses this limitation by incorporating a pre-fine-tuning stage that adapts the parent model to the child source language characteristics, improving child model initialization and overall translation quality. Secondly, we propose a training method that enables the model to learn pre-ordering knowledge and encode the word reordering information within the contextualized representation of source sentences. Pre-ordering refers to rearranging source-side words to better align with the target-side word order before translation, which helps mitigate word-order differences between languages. Existing methods typically integrate the information of pre-ordered source sentences at the token level, where each token is assigned a local representation that fails to capture broader contextual dependencies. Moreover, these methods still require pre-ordered sentences during inference, which incur additional inference costs. In contrast, our method enables the model to encode the pre-ordering information in the contextualized representations of source sentences. In addition, our method eliminates the need for pre-ordering sentences at inference time while preserving its benefits in improving translation quality. Thirdly, to address data scarcity in low-resource scenarios, we propose a data augmentation strategy that employs high-quality translation models trained bidirectionally on high-resource language pairs. This strategy generates diverse, high-fidelity pseudo-training data through systematic sentence rephrasing, generating multiple target translations for each source sentence.. The increased diversity on the target side enhances the model's robustness, as demonstrated by significant performance improvements in eight pairs of low-resource languages. Finally, we conduct an empirical study to explore the potential of applying ChatGPT for machine translation. We design a set of translation prompts incorporating various auxiliary information to assist ChatGPT in generating translations. Our findings indicate that, with carefully designed prompts, ChatGPT can achieve results comparable to those of commercial translation systems for high-resource languages. Moreover, this study establishes a foundation for future research, offering insights into prompt engineering strategies for leveraging large language models in machine translation tasks.Item Cross-lingual learning in low-resource : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Natural and Computational Sciences, Massey University, Auckland, New Zealand(Massey University, 2022) Zhao, JiaweiCurrent machine translation techniques were developed using predominantly rich resource language pairs. However, there is a broader range of languages used in practice around the world. For instance, machine translation between Finnish, Chinese and Russian is still not suitable for high-quality communication. This dissertation focuses on building cross-lingual models to address this issue. I aim to analyse the relationships between embeddings of different languages, especially low-resource languages. I investigate four phenomena that can improve the translation of low-resource languages. The first study concentrates on the non-linearity of cross-lingual word embeddings. Current approaches primarily focus on linear mapping between the word embeddings of different languages. However, those approaches don't seem to work as well with some language pairs, mostly if the two languages belong to different language families, e.g. English and Chinese. I hypothesise that linearity, which is often assumed in the geometric relationship between monolingual word embeddings of different languages, may not hold for all language pairs. I focus on investigating the relationship between word embeddings of languages in different language families. I show that non-linearity can better describe the relationship in those language pairs using multiple datasets. The second study focuses on the unsupervised cross-lingual word embeddings for low-resource languages. Conventional approach to constructing cross-lingual word embeddings requires a large dictionary, which is hard to obtain for low-resource languages. I propose an unsupervised approach to learning cross-lingual word embeddings for low-resource languages. By incorporating kernel canonical correlation analysis, the proposed approach can better learn high-quality cross-lingual word embeddings in an unsupervised scenario. The third study investigates a dictionary augmentation technique for low-resource languages. A key challenge for constructing an accurately augmented dictionary is the high variance issue. I propose a semi-supervised method that can bootstrap a small dictionary into a larger high-quality dictionary. The fourth study concentrates on the data insufficiency issue in speech translation. The lack of training data availability for low-resource languages limits the performance of end-to-end speech translation. I investigate the use of knowledge distillation to transfer knowledge from the machine translation task to the speech translation task and propose a new training methodology. The results and analyses presented in this work show that a wide range of techniques can address issues that arise with low-resource languages in the machine translation field. This dissertation provides a deeper insight into understanding the word representations and structures in low-resource translation and should aid future researchers to better utilise their translation models.
