End-to-end automatic speech recognition for low-resource languages : a thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in Computer Science at the School of Mathematical and Computational Sciences, Massey University, Auckland, New Zealand

dc.confidentialEmbargo : yesen_US
dc.contributor.advisorWang, Ruili
dc.contributor.authorSatwinder Singh
dc.date.accessioned2023-08-07T02:42:48Z
dc.date.accessioned2023-08-27T23:15:15Z
dc.date.available2023-08-07T02:42:48Z
dc.date.available2023-08-27T23:15:15Z
dc.date.issued2023
dc.description.abstractAutomatic speech recognition (ASR) for low-resource languages presents numerous challenges due to the lack of various crucial linguistic resources including annotated speech corpus, lexicon, and raw language text. In this thesis, we propose different approaches to improve fundamental frequency estimation and speech recognition for low-resource languages. Firstly, we propose DeepF0, a new deep learning technique for fundamental frequency (F0) estimation. Existing models have limited learning capabilities due to using a shallow receptive field. Our DeepF0 extends the receptive field by using dilated convolutional blocks. Additionally, we enhance training efficiency and speed by incorporating residual blocks with residual connections. We achieve state-of-the-art results with DeepF0, even using 77.4% fewer network parameters. Secondly, we introduce a new meta-learning framework for low-resource speech recognition that improves on the previous model-agnostic meta-learning (MAML) approach. Our framework addresses issues of MAML such as training instabilities and slower convergence by using a multi-step loss (MSL). MSL calculates losses at each step of MAML's inner loop and combines them using a weighted importance vector, which prioritizes the loss at the last step. Thirdly, we propose an end-to-end ASR approach for low-resource languages that exploit the synthesized datasets along with real speech datasets. We evaluate our approach on the low-resource Punjabi language, which is widely spoken across the globe by millions of speakers, however, still lacks annotated speech datasets. Our empirical results show that our synthesized datasets (Google-synth and CMU-synth) can significantly improve the accuracy of our ASR model. Lastly, we introduce a self-training approach, also known as the pseudo-labeling approach, to enhance the performance of low-resource speech recognition. While most self-training research has centered on high-resource languages such as English, our work is focused on the low-resource Punjabi language. To weed out the low-quality pseudo-labels, we employ length normalized confidence score. Overall, our experimental evaluation validates the efficacy of our proposed approaches and shows that they outperform existing baseline approaches for F0 estimation and low-resource speech recognition.en_US
dc.identifier.urihttp://hdl.handle.net/10179/19790
dc.publisherMassey Universityen_US
dc.rights© The Authoren_US
dc.subjectAutomatic speech recognitionen
dc.subjectDeep learning (Machine learning)en
dc.subjectPanjabi languageen
dc.subject.anzsrc460212 Speech recognitionen
dc.titleEnd-to-end automatic speech recognition for low-resource languages : a thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in Computer Science at the School of Mathematical and Computational Sciences, Massey University, Auckland, New Zealanden_US
dc.typeThesisen_US
massey.contributor.authorSatwinder Singhen_US
thesis.degree.disciplineComputer Scienceen_US
thesis.degree.grantorMassey Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophy (PhD)en_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SatwinderSinghPhDThesis.pdf
Size:
1.24 MB
Format:
Adobe Portable Document Format