Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 8 of 8
  • Item
    Speech processing with deep learning for voice-based respiratory diagnosis : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand
    (Massey University, 2022) Ma, Zhizhong
    Voice-based respiratory diagnosis research aims at automatically screening and diagnosing respiratory-related symptoms (e.g., smoking status, COVID-19 infection) from human-generated sounds (e.g., breath, cough, speech). It has the potential to be used as an objective, simple, reliable, and less time-consuming method than traditional biomedical diagnosis methods. In this thesis, we conduct one comprehensive literature review and propose three novel deep learning methods to enrich voice-based respiratory diagnosis research and improve its performance. Firstly, we conduct a comprehensive investigation of the effects of voice features on the detection of smoking status. Secondly, we propose a novel method that uses the combination of both high-level and low-level acoustic features along with deep neural networks for smoking status identification. Thirdly, we investigate various feature extraction/representation methods and propose a SincNet-based CNN method for feature representations to further improve the performance of smoking status identification. To the best of our knowledge, this is the first systemic study that applies speech processing with deep learning for voice-based smoking status identification. Moreover, we propose a novel transfer learning scheme and a task-driven feature representation method for diagnosing respiratory diseases (e.g., COVID-19) from human-generated sounds. We find those transfer learning methods using VGGish, wav2vec 2.0 and PASE+, and our proposed task-driven method Sinc-ResNet have achieved competitive performance compared with other work. The findings of this study provide a new perspective and insights for voice-based respiratory disease diagnosis. The experimental results demonstrate the effectiveness of our proposed methods and show that they have achieved better performances compared to other existing methods.
  • Item
    Deep learning for speech enhancement : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand
    (Massey University, 2022) Qiu, Yuanhang
    Speech enhancement, aiming at improving the intelligibility and overall perceptual quality of a contaminated speech signal, is an effective way to improve speech communications. In this thesis, we propose three novel deep learning methods to improve speech enhancement performance. Firstly, we propose an adversarial latent representation learning for latent space exploration of generative adversarial network based speech enhancement. Based on adversarial feature learning, this method employs an extra encoder to learn an inverse mapping from the generated data distribution to the latent space. The encoder establishes an inner connection with the generator and contributes to latent information learning. Secondly, we propose an adversarial multi-task learning with inverse mappings method for effective speech representation. This speech enhancement method focuses on enhancing the generator's capability of speech information capture and representation learning. To implement this method, two extra networks are developed to learn the inverse mappings from the generated distribution to the input data domains. Thirdly, we propose a self-supervised learning based phone-fortified method to improve specific speech characteristics learning for speech enhancement. This method explicitly imports phonetic characteristics into a deep complex convolutional network via a contrastive predictive coding model pre-trained with self-supervised learning. The experimental results demonstrate that the proposed methods outperform previous speech enhancement methods and achieve state-of-the-art performance in terms of speech intelligibility and overall perceptual quality.
  • Item
    The voice activity detection (VAD) recorder and VAD network recorder : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Science at Massey University
    (Massey University, 2001) Liu, Feng
    The project is to provide a feasibility study for the AudioGraph tool, focusing on two application areas: the VAD (voice activity detector) recorder and the VAD network recorder. The first one achieves a low bit-rate speech recording on the fly, using a GSM compression coder with a simple VAD algorithm; and the second one provides two-way speech over IP, fulfilling echo cancellation with a simplex channel. The latter is required for implementing a synchronous AudioGraph. In the first chapter we introduce the background of this project, specifically, the VoIP technology, the AudioGraph tool, and the VAD algorithms. We also discuss the problems set for this project. The second chapter presents all the relevant techniques in detail, including sound representation, speech-coding schemes, sound file formats, PowerPlant and Macintosh programming issues, and the simple VAD algorithm we have developed. The third chapter discusses the implementation issues, including the systems' objective, architecture, the problems encountered and solutions used. The fourth chapter illustrates the results of the two applications. The user documentations for the applications are given, and after that, we analyse the parameters based on the results. We also present the default settings of the parameters, which could be used in the AudioGraph system. The last chapter provides conclusions and future work.
  • Item
    Spoken affect classification : algorithms and experimental implementation : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Science at Massey University, Palmerston North, New Zealand
    (Massey University, 2005) Morrison, Donn Alexander
    Machine-based emotional intelligence is a requirement for natural interaction between humans and computer interfaces and a basic level of accurate emotion perception is needed for computer systems to respond adequately to human emotion. Humans convey emotional information both intentionally and unintentionally via speech patterns. These vocal patterns are perceived and understood by listeners during conversation. This research aims to improve the automatic perception of vocal emotion in two ways. First, we compare two emotional speech data sources: natural, spontaneous emotional speech and acted or portrayed emotional speech. This comparison demonstrates the advantages and disadvantages of both acquisition methods and how these methods affect the end application of vocal emotion recognition. Second, we look at two classification methods which have gone unexplored in this field: stacked generalisation and unweighted vote. We show how these techniques can yield an improvement over traditional classification methods.
  • Item
    Real-time implementation of a dual microphone beamformer : a thesis presented in partial fulfilment of the requirements for the degree of Master of Engineering in Computer Systems at Massey University, Albany, New Zealand
    (Massey University, 2005) Yoganathan, Vaitheki
    The main objective of this project is to develop a microphone array system, which captures the speech signal for a speech related application. This system should allow the user to move freely and acquire the speech from adverse acoustic environments. The most important problem when the distance between the speaker and the microphone increases is that often the quality of the speech signal is degraded by background noise and reverberation. As a result, the speech related applications fails to perform well under these circumstances. This unwanted noise components present in the acquired signal have to be removed in order to improve the performance of these applications. This thesis contains the development of a dual microphone beamformer in a Digital Signal Processor (DSP). The development kit used in this project is the Texas Instruments TMS320C6711 DSP Starter Kit (DSK). The switched Griffiths-Jim beamformer was selected as the algorithm to be implemented in the DSK. Van Compernolle developed this algorithm in 1990 by modifying the Griffiths-Jim beamformer structure. This beamformer algorithm is used to improve the quality of the desired speech signal by reducing the background noise. This algorithm requires atleast two input channels to obtain the spatial characteristics of the acquired signal. Therefore, the PCM3003 audio daughter card is used to access the two microphone signals. The software implementation of the switched Griffiths-Jim beamformer algorithm has two main stages. The first stage is to identify the presence of speech in the acquired signal. A simple Voice Activity Detector (VAD) based on the energy of the acquired signal is used to distinguish between the wanted speech signal and the unwanted noise signals. The second stage is the adaptive beamformer, which uses the results obtained from the VAD algorithm to reduce the background noise. The adaptive beamformer consists of two adaptive filters based on the Normalised Least Mean Squares (NLMS) algorithm. The first filter behaves like a beam-steering filter and it's only updated during the presence of speech and noise signal. The second filter behaves like an Adaptive Noise Canceller (ANC) and it is only updated when a noise alone period is present. The VAD algorithm controls the updating process of these NLMS filters and only one of these filters is updated at any given time. This algorithm was successfully implemented in the chosen DSK using the Code Composer Studio (CCS) software. This implementation is tested in real-time using a speech recognition system. This system is programmed in Visual Basic software using the Microsoft Speech SDK components. This dual microphone system allows the user to move around freely and acquire the desired speech signal. The results show a reasonable amount of enhancement in the output signal, and a significant improvement in the ease of using the speech recognition system is achieved.
  • Item
    Multi-microphone speech enhancement technique using a novel neural network beamformer : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Engineering at Massey University, Albany, New Zealand
    (Massey University, 2014) Yoganathan, Vaitheki
    This thesis presents a novel speech enhancement algorithm to reduce the background noise from the acquired speech signal. It introduces an innovative idea for the speech beamformer using an input delay neural network based adaptive filter for noise reduction. Speech communication is considered as the most popular and natural way for humans to communicate with computers. In the past few decades, there has been an increased demand for speech-based applications; examples include personal dictation devices, hands-free telephony, voice recognition for robotics, speech-controlled equipment, automated phone systems, etc. However, these applications require a high signal-to-noise ratio to function effectively. The background noise sources such as factory machine noises, television, radio, computer or another competing speaker, often degrade the performance of the acquired signals. The problem of removing these unwanted signals from the acquired speech signal has been investigated by various authors. However, there is still room for improvement to the existing methods. A multi-microphone neural network based switched Griffiths-Jim beamformer structure was implemented using the Labview software. The conventional noise reduction section of the Griffiths and Jim beamformer structure was improved with a non-linear neural network approach. A partially connected three-layer neural network structure was implemented for rapid real-time processing. The error back-propagation algorithm was used here to train the neural network structure. Although it is a slow gradient learning algorithm, it can be easily replaced with other algorithms such as the fast back-propagation algorithm. The proposed algorithms show promising noise reduction improvement over the previous adaptive algorithms like the normalised least mean squares adaptive filter. However, the performance of the neural network depends on its chosen parameters such as learning rate, amount of training given, and the size of the neural network structure. Tests with a speech-controlled system demonstrate that the neural network based beamformer significantly improves the recognition rate of the system.
  • Item
    AutoTC : Automatic Time-Code recognition for the purpose of synchronisation of subtitles in the broadcasting of motion pictures using the SMPTE standard : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Science, School of Engineering and Advanced Technology (S.E.A.T.) at Massey University, (Albany), New Zealand
    (Massey University, 2011) Kashyap, Vineet
    Time-coding requires manual marking of the in/out points of every dialogue spoken in the lm. This process is very manual, cumbersome and time-consuming which takes about 8-10 hours for an average lm duration of 2 and a half hours. AutoTC, a multi-threaded client-server application has been built for the automatic recognition of time-codes for the purpose of automatic synchronisation of subtitles in the broadcasting of Motion Pictures. It involves generating time-codes programmatically based on the input video's frame rate to be subtitled and using the audio to recognise in/out points automatically using the principles of Voice Activity Detection. Results show that the time taken to recognise time-codes automatically is approximately 1/6th compared to the the time taken by a professional time-coder using `Poliscript'[18], a commercial tool used in the production of subtitles. `IN-SYNC', a new performance metric, has been proposed to evaluate the accuracy of the developed system which will foster further research and development in the eld of automatic subtitling in an attempt to make it the de-facto standard. The application has been tested on the NOIZEUS[30] corpus giving an IN-SYNC accuracy of 65% on clean data with 6 mis-detections and an average of 51.56% on noisy data with 13 mis-detections which is very encouraging. The application can also send data to the MOSES[32] server developed for producing draft translations from Hindi to English which will make the subtitling process much faster, e cient and quality-centric.
  • Item
    Continuous speech recognition : an analysis of its effect on listening comprehension, listening strategies and notetaking : a thesis presented in part fulfilment of the requirements for the degree of Doctorate in Education, Massey University
    (Massey University, 2006) McIvor, Tom
    This thesis presents an investigation into the effect of Liberated Learning Technology (LLP) on academic listening comprehension, notetaking and listening strategies in an English as a foreign language context (L2). Two studies are reported: an exploratory study and subsequent main study. The exploratory study was undertaken to determine L2 and native speaker (L1) students' perceptions on the effectiveness of the technology on academic listening and notetaking. The main study took a more focused approach and as a result, extended the exploratory study that was done in an authentic lecture context in order to gather data to measure listening comprehension and notetaking quality. The participants in the main study comprised six L2 students: five of whom intended to go to university. The methodology was a multimethod one: data was gathered from notetaking samples, protocol analysis, email responses and a questionnaire. Results indicated that continuous speech recognition (CSR) has the potential to support the listening comprehension and notetaking abilities of L2 students as well as facilitate metacognitive listening strategy use and enhance affective factors in academic listening. However, it is important to note that as CSR is an innovative technology, it first needs to meet a number of challenges before its full potential can be realized. Consequently, recommendations for future research and potential innovative uses for the technology are discussed. This thesis contributes to L2 academic listening and notetaking measurement in two areas: 1. the measurement of LLP-supported notetaking; and, 2. the measurement of LLP-supported academic listening comprehension.