End-to-end speech synthesis for Chinese-English code-switching scenario : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Sciences at Massey University, Auckland, New Zealand

Thumbnail Image
Open Access Location
Journal Title
Journal ISSN
Volume Title
Massey University
The Author
Text-To-Speech (TTS), namely speech synthesis, allows machines to convert textual information into corresponding audio information by imitating human beings to produce human-like voices. TTS has been widely used in various monolingual speech synthesis tasks such as broadcasting systems and audiobooks. However, it is still a challenge for machines to process multilingual input and output sequences. Challenges may arise from the problem of the lack of code-switching speech data, the mapping problem of mixed languages, and the linguistic complexity of Chinese, such as polyphony and tonal sandhi scenarios in text frontend processing. In this thesis, we propose an end-to-end speech synthesis system based on a traditional monolingual Tacotron model to realize the speech synthesis of Chinese-English code-switching sentences. Firstly, we pre-process the speech data from the perspectives of low-frequency noise removal, frequency smoothness, and volume consistency by using a high-pass filter to smooth the speech frequencies ranging and normalizing the speech volume. Secondly, we apply g2pm and python-pinyin as our G2P tools which are merged into our mixed Chinese-English code-switching fronted processing. We solve the issue of language speaking failure and processing failure of the switched language of the current monolingual-support speech synthesis markup language, which is improved to be able to process mixed Chinese-English code-switching SSML input. We also further extend the rules of polyphone and tone sandhi of the Chinese part in code-switching sentences. Thirdly, we improve the attention mechanism module of the current Tacotron model to avoid the possible posterior collapse issue by transferring all intermediate frames to the next processing to keep the contextual correlation of adjacent frames, instead of only transferring the last frame, which will lose the context information. Fourthly, we accelerate the training process by adding a six-layer unidirectional sequence-to-sequence gated recurrent unit to predict more non-overlapping multi-frame outputs at each decoder step. The result of our test data reaches the highest score of 3.163 PESQ raw MOS and 3.065 MOS-LQO, and the average score of 2.672 PESQ raw MOS, and 2.520 MOS-LQO.
TTS, speech synthesis, end-to-end, code-switching, polyphone, tone sandhi, G2P