End-to-end speech synthesis for Chinese-English code-switching scenario : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Sciences at Massey University, Auckland, New Zealand

Zhang, Qingci

End-to-end speech synthesis for Chinese-English code-switching scenario : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Sciences at Massey University, Auckland, New Zealand

dc.contributor.author	Zhang, Qingci
dc.date.accessioned	2023-09-13T23:27:47Z
dc.date.available	2023-09-13T23:27:47Z
dc.date.issued	2022
dc.description.abstract	Text-To-Speech (TTS), namely speech synthesis, allows machines to convert textual information into corresponding audio information by imitating human beings to produce human-like voices. TTS has been widely used in various monolingual speech synthesis tasks such as broadcasting systems and audiobooks. However, it is still a challenge for machines to process multilingual input and output sequences. Challenges may arise from the problem of the lack of code-switching speech data, the mapping problem of mixed languages, and the linguistic complexity of Chinese, such as polyphony and tonal sandhi scenarios in text frontend processing. In this thesis, we propose an end-to-end speech synthesis system based on a traditional monolingual Tacotron model to realize the speech synthesis of Chinese-English code-switching sentences. Firstly, we pre-process the speech data from the perspectives of low-frequency noise removal, frequency smoothness, and volume consistency by using a high-pass filter to smooth the speech frequencies ranging and normalizing the speech volume. Secondly, we apply g2pm and python-pinyin as our G2P tools which are merged into our mixed Chinese-English code-switching fronted processing. We solve the issue of language speaking failure and processing failure of the switched language of the current monolingual-support speech synthesis markup language, which is improved to be able to process mixed Chinese-English code-switching SSML input. We also further extend the rules of polyphone and tone sandhi of the Chinese part in code-switching sentences. Thirdly, we improve the attention mechanism module of the current Tacotron model to avoid the possible posterior collapse issue by transferring all intermediate frames to the next processing to keep the contextual correlation of adjacent frames, instead of only transferring the last frame, which will lose the context information. Fourthly, we accelerate the training process by adding a six-layer unidirectional sequence-to-sequence gated recurrent unit to predict more non-overlapping multi-frame outputs at each decoder step. The result of our test data reaches the highest score of 3.163 PESQ raw MOS and 3.065 MOS-LQO, and the average score of 2.672 PESQ raw MOS, and 2.520 MOS-LQO.	en
dc.identifier.uri	http://hdl.handle.net/10179/20088
dc.language.iso	en	en
dc.publisher	Massey University	en
dc.rights	The Author	en
dc.subject	TTS	en
dc.subject	speech synthesis	en
dc.subject	end-to-end	en
dc.subject	code-switching	en
dc.subject	polyphone	en
dc.subject	tone sandhi	en
dc.subject	G2P	en
dc.subject.anzsrc	460212 Speech recognition	en
dc.title	End-to-end speech synthesis for Chinese-English code-switching scenario : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Sciences at Massey University, Auckland, New Zealand	en
dc.type	Thesis	en
massey.contributor.author	Zhang, Qingci
thesis.degree.discipline	Computer Sciences	en
thesis.degree.level	Masters	en
thesis.degree.name	Master of Information Sciences (MInfSc)	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: ZhangMInfScThesis.pdf
Size:: 2.48 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 3.32 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses and Dissertations