Massey Documents by Type
Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294
Browse
5 results
Search Results
Item End-to-end speech synthesis for Chinese-English code-switching scenario : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Sciences at Massey University, Auckland, New Zealand(Massey University, 2022) Zhang, QingciText-To-Speech (TTS), namely speech synthesis, allows machines to convert textual information into corresponding audio information by imitating human beings to produce human-like voices. TTS has been widely used in various monolingual speech synthesis tasks such as broadcasting systems and audiobooks. However, it is still a challenge for machines to process multilingual input and output sequences. Challenges may arise from the problem of the lack of code-switching speech data, the mapping problem of mixed languages, and the linguistic complexity of Chinese, such as polyphony and tonal sandhi scenarios in text frontend processing. In this thesis, we propose an end-to-end speech synthesis system based on a traditional monolingual Tacotron model to realize the speech synthesis of Chinese-English code-switching sentences. Firstly, we pre-process the speech data from the perspectives of low-frequency noise removal, frequency smoothness, and volume consistency by using a high-pass filter to smooth the speech frequencies ranging and normalizing the speech volume. Secondly, we apply g2pm and python-pinyin as our G2P tools which are merged into our mixed Chinese-English code-switching fronted processing. We solve the issue of language speaking failure and processing failure of the switched language of the current monolingual-support speech synthesis markup language, which is improved to be able to process mixed Chinese-English code-switching SSML input. We also further extend the rules of polyphone and tone sandhi of the Chinese part in code-switching sentences. Thirdly, we improve the attention mechanism module of the current Tacotron model to avoid the possible posterior collapse issue by transferring all intermediate frames to the next processing to keep the contextual correlation of adjacent frames, instead of only transferring the last frame, which will lose the context information. Fourthly, we accelerate the training process by adding a six-layer unidirectional sequence-to-sequence gated recurrent unit to predict more non-overlapping multi-frame outputs at each decoder step. The result of our test data reaches the highest score of 3.163 PESQ raw MOS and 3.065 MOS-LQO, and the average score of 2.672 PESQ raw MOS, and 2.520 MOS-LQO.Item End-to-end automatic speech recognition for low-resource languages : a thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in Computer Science at the School of Mathematical and Computational Sciences, Massey University, Auckland, New Zealand(Massey University, 2023) Satwinder SinghAutomatic speech recognition (ASR) for low-resource languages presents numerous challenges due to the lack of various crucial linguistic resources including annotated speech corpus, lexicon, and raw language text. In this thesis, we propose different approaches to improve fundamental frequency estimation and speech recognition for low-resource languages. Firstly, we propose DeepF0, a new deep learning technique for fundamental frequency (F0) estimation. Existing models have limited learning capabilities due to using a shallow receptive field. Our DeepF0 extends the receptive field by using dilated convolutional blocks. Additionally, we enhance training efficiency and speed by incorporating residual blocks with residual connections. We achieve state-of-the-art results with DeepF0, even using 77.4% fewer network parameters. Secondly, we introduce a new meta-learning framework for low-resource speech recognition that improves on the previous model-agnostic meta-learning (MAML) approach. Our framework addresses issues of MAML such as training instabilities and slower convergence by using a multi-step loss (MSL). MSL calculates losses at each step of MAML's inner loop and combines them using a weighted importance vector, which prioritizes the loss at the last step. Thirdly, we propose an end-to-end ASR approach for low-resource languages that exploit the synthesized datasets along with real speech datasets. We evaluate our approach on the low-resource Punjabi language, which is widely spoken across the globe by millions of speakers, however, still lacks annotated speech datasets. Our empirical results show that our synthesized datasets (Google-synth and CMU-synth) can significantly improve the accuracy of our ASR model. Lastly, we introduce a self-training approach, also known as the pseudo-labeling approach, to enhance the performance of low-resource speech recognition. While most self-training research has centered on high-resource languages such as English, our work is focused on the low-resource Punjabi language. To weed out the low-quality pseudo-labels, we employ length normalized confidence score. Overall, our experimental evaluation validates the efficacy of our proposed approaches and shows that they outperform existing baseline approaches for F0 estimation and low-resource speech recognition.Item Speech processing with deep learning for voice-based respiratory diagnosis : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand(Massey University, 2022) Ma, ZhizhongVoice-based respiratory diagnosis research aims at automatically screening and diagnosing respiratory-related symptoms (e.g., smoking status, COVID-19 infection) from human-generated sounds (e.g., breath, cough, speech). It has the potential to be used as an objective, simple, reliable, and less time-consuming method than traditional biomedical diagnosis methods. In this thesis, we conduct one comprehensive literature review and propose three novel deep learning methods to enrich voice-based respiratory diagnosis research and improve its performance. Firstly, we conduct a comprehensive investigation of the effects of voice features on the detection of smoking status. Secondly, we propose a novel method that uses the combination of both high-level and low-level acoustic features along with deep neural networks for smoking status identification. Thirdly, we investigate various feature extraction/representation methods and propose a SincNet-based CNN method for feature representations to further improve the performance of smoking status identification. To the best of our knowledge, this is the first systemic study that applies speech processing with deep learning for voice-based smoking status identification. Moreover, we propose a novel transfer learning scheme and a task-driven feature representation method for diagnosing respiratory diseases (e.g., COVID-19) from human-generated sounds. We find those transfer learning methods using VGGish, wav2vec 2.0 and PASE+, and our proposed task-driven method Sinc-ResNet have achieved competitive performance compared with other work. The findings of this study provide a new perspective and insights for voice-based respiratory disease diagnosis. The experimental results demonstrate the effectiveness of our proposed methods and show that they have achieved better performances compared to other existing methods.Item Computational methods for a generalised acoustics analysis workflow : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Science at Massey University, Auckland, New Zealand(Massey University, 2022) Fukuzawa, YukioIn this thesis, I develop a computer process for analysing birdsong dialect and implement a brand new software named Koe that facilitates real-world analysis of that nature. Koe can also be used for other kinds of analysis of birdsongs, and can also be used for other animal species too. As of the time this thesis is written, Koe has a user base of nearly 800 users all around the world, including bird enthusiasts, biology students and researchers. Collectively Koe is now hosting more than 1300 datasets, many of which are non-bird sounds.Item Deep learning for speech enhancement : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand(Massey University, 2022) Qiu, YuanhangSpeech enhancement, aiming at improving the intelligibility and overall perceptual quality of a contaminated speech signal, is an effective way to improve speech communications. In this thesis, we propose three novel deep learning methods to improve speech enhancement performance. Firstly, we propose an adversarial latent representation learning for latent space exploration of generative adversarial network based speech enhancement. Based on adversarial feature learning, this method employs an extra encoder to learn an inverse mapping from the generated data distribution to the latent space. The encoder establishes an inner connection with the generator and contributes to latent information learning. Secondly, we propose an adversarial multi-task learning with inverse mappings method for effective speech representation. This speech enhancement method focuses on enhancing the generator's capability of speech information capture and representation learning. To implement this method, two extra networks are developed to learn the inverse mappings from the generated distribution to the input data domains. Thirdly, we propose a self-supervised learning based phone-fortified method to improve specific speech characteristics learning for speech enhancement. This method explicitly imports phonetic characteristics into a deep complex convolutional network via a contrastive predictive coding model pre-trained with self-supervised learning. The experimental results demonstrate that the proposed methods outperform previous speech enhancement methods and achieve state-of-the-art performance in terms of speech intelligibility and overall perceptual quality.
