End-to-end automatic speech recognition for low-resource languages : a thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in Computer Science at the School of Mathematical and Computational Sciences, Massey University, Auckland, New Zealand. EMBARGOED until 1 August 2025.

Thumbnail Image
Open Access Location
Journal Title
Journal ISSN
Volume Title
Massey University
The Author
Automatic speech recognition (ASR) for low-resource languages presents numerous challenges due to the lack of various crucial linguistic resources including annotated speech corpus, lexicon, and raw language text. In this thesis, we propose different approaches to improve fundamental frequency estimation and speech recognition for low-resource languages. Firstly, we propose DeepF0, a new deep learning technique for fundamental frequency (F0) estimation. Existing models have limited learning capabilities due to using a shallow receptive field. Our DeepF0 extends the receptive field by using dilated convolutional blocks. Additionally, we enhance training efficiency and speed by incorporating residual blocks with residual connections. We achieve state-of-the-art results with DeepF0, even using 77.4% fewer network parameters. Secondly, we introduce a new meta-learning framework for low-resource speech recognition that improves on the previous model-agnostic meta-learning (MAML) approach. Our framework addresses issues of MAML such as training instabilities and slower convergence by using a multi-step loss (MSL). MSL calculates losses at each step of MAML's inner loop and combines them using a weighted importance vector, which prioritizes the loss at the last step. Thirdly, we propose an end-to-end ASR approach for low-resource languages that exploit the synthesized datasets along with real speech datasets. We evaluate our approach on the low-resource Punjabi language, which is widely spoken across the globe by millions of speakers, however, still lacks annotated speech datasets. Our empirical results show that our synthesized datasets (Google-synth and CMU-synth) can significantly improve the accuracy of our ASR model. Lastly, we introduce a self-training approach, also known as the pseudo-labeling approach, to enhance the performance of low-resource speech recognition. While most self-training research has centered on high-resource languages such as English, our work is focused on the low-resource Punjabi language. To weed out the low-quality pseudo-labels, we employ length normalized confidence score. Overall, our experimental evaluation validates the efficacy of our proposed approaches and shows that they outperform existing baseline approaches for F0 estimation and low-resource speech recognition.
Embargoed until 1 August 2025.
Automatic speech recognition, Deep learning (Machine learning), Panjabi language