End-to-end speech synthesis for Chinese-English code-switching scenario : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Sciences at Massey University, Auckland, New Zealand

dc.contributor.authorZhang, Qingci
dc.date.accessioned2023-09-13T23:27:47Z
dc.date.available2023-09-13T23:27:47Z
dc.date.issued2022
dc.description.abstractText-To-Speech (TTS), namely speech synthesis, allows machines to convert textual information into corresponding audio information by imitating human beings to produce human-like voices. TTS has been widely used in various monolingual speech synthesis tasks such as broadcasting systems and audiobooks. However, it is still a challenge for machines to process multilingual input and output sequences. Challenges may arise from the problem of the lack of code-switching speech data, the mapping problem of mixed languages, and the linguistic complexity of Chinese, such as polyphony and tonal sandhi scenarios in text frontend processing. In this thesis, we propose an end-to-end speech synthesis system based on a traditional monolingual Tacotron model to realize the speech synthesis of Chinese-English code-switching sentences. Firstly, we pre-process the speech data from the perspectives of low-frequency noise removal, frequency smoothness, and volume consistency by using a high-pass filter to smooth the speech frequencies ranging and normalizing the speech volume. Secondly, we apply g2pm and python-pinyin as our G2P tools which are merged into our mixed Chinese-English code-switching fronted processing. We solve the issue of language speaking failure and processing failure of the switched language of the current monolingual-support speech synthesis markup language, which is improved to be able to process mixed Chinese-English code-switching SSML input. We also further extend the rules of polyphone and tone sandhi of the Chinese part in code-switching sentences. Thirdly, we improve the attention mechanism module of the current Tacotron model to avoid the possible posterior collapse issue by transferring all intermediate frames to the next processing to keep the contextual correlation of adjacent frames, instead of only transferring the last frame, which will lose the context information. Fourthly, we accelerate the training process by adding a six-layer unidirectional sequence-to-sequence gated recurrent unit to predict more non-overlapping multi-frame outputs at each decoder step. The result of our test data reaches the highest score of 3.163 PESQ raw MOS and 3.065 MOS-LQO, and the average score of 2.672 PESQ raw MOS, and 2.520 MOS-LQO.en
dc.identifier.urihttp://hdl.handle.net/10179/20088
dc.language.isoenen
dc.publisherMassey Universityen
dc.rightsThe Authoren
dc.subjectTTSen
dc.subjectspeech synthesisen
dc.subjectend-to-enden
dc.subjectcode-switchingen
dc.subjectpolyphoneen
dc.subjecttone sandhien
dc.subjectG2Pen
dc.subject.anzsrc460212 Speech recognitionen
dc.titleEnd-to-end speech synthesis for Chinese-English code-switching scenario : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Sciences at Massey University, Auckland, New Zealanden
dc.typeThesisen
massey.contributor.authorZhang, Qingci
thesis.degree.disciplineComputer Sciencesen
thesis.degree.levelMastersen
thesis.degree.nameMaster of Information Sciences (MInfSc)en
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ZhangMInfScThesis.pdf
Size:
2.48 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.32 KB
Format:
Item-specific license agreed upon to submission
Description: