Massey Documents by Type

Permanent URI for this communityhttps://mro.massey.ac.nz/handle/10179/294

Browse

Search Results

Now showing 1 - 10 of 10

Real and synthetic Punjabi speech datasets for automatic speech recognition
(Elsevier Inc, 2024-02) Singh S; Hou F; Wang R
Automatic speech recognition (ASR) has been an active area of research. Training with large annotated datasets is the key to the development of robust ASR systems. However, most available datasets are focused on high-resource languages like English, leaving a significant gap for low-resource languages. Among these languages is Punjabi, despite its large number of speakers, Punjabi lacks high-quality annotated datasets for accurate speech recognition. To address this gap, we introduce three labeled Punjabi speech datasets: Punjabi Speech (real speech dataset) and Google-synth/CMU-synth (synthesized speech datasets). The Punjabi Speech dataset consists of read speech recordings captured in various environments, including both studio and open settings. In addition, the Google-synth dataset is synthesized using Google's Punjabi text-to-speech cloud services. Furthermore, the CMU-synth dataset is created using the Clustergen model available in the Festival speech synthesis system developed by CMU. These datasets aim to facilitate the development of accurate Punjabi speech recognition systems, bridging the resource gap for this important language.
End-to-end automatic speech recognition for low-resource languages : a thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in Computer Science at the School of Mathematical and Computational Sciences, Massey University, Auckland, New Zealand
(Massey University, 2023) Satwinder Singh
Automatic speech recognition (ASR) for low-resource languages presents numerous challenges due to the lack of various crucial linguistic resources including annotated speech corpus, lexicon, and raw language text. In this thesis, we propose different approaches to improve fundamental frequency estimation and speech recognition for low-resource languages. Firstly, we propose DeepF0, a new deep learning technique for fundamental frequency (F0) estimation. Existing models have limited learning capabilities due to using a shallow receptive field. Our DeepF0 extends the receptive field by using dilated convolutional blocks. Additionally, we enhance training efficiency and speed by incorporating residual blocks with residual connections. We achieve state-of-the-art results with DeepF0, even using 77.4% fewer network parameters. Secondly, we introduce a new meta-learning framework for low-resource speech recognition that improves on the previous model-agnostic meta-learning (MAML) approach. Our framework addresses issues of MAML such as training instabilities and slower convergence by using a multi-step loss (MSL). MSL calculates losses at each step of MAML's inner loop and combines them using a weighted importance vector, which prioritizes the loss at the last step. Thirdly, we propose an end-to-end ASR approach for low-resource languages that exploit the synthesized datasets along with real speech datasets. We evaluate our approach on the low-resource Punjabi language, which is widely spoken across the globe by millions of speakers, however, still lacks annotated speech datasets. Our empirical results show that our synthesized datasets (Google-synth and CMU-synth) can significantly improve the accuracy of our ASR model. Lastly, we introduce a self-training approach, also known as the pseudo-labeling approach, to enhance the performance of low-resource speech recognition. While most self-training research has centered on high-resource languages such as English, our work is focused on the low-resource Punjabi language. To weed out the low-quality pseudo-labels, we employ length normalized confidence score. Overall, our experimental evaluation validates the efficacy of our proposed approaches and shows that they outperform existing baseline approaches for F0 estimation and low-resource speech recognition.
Speech driven user interface for an intelligent house : a thesis presented in partial fulfilment of the requirements for the degree of Master of Engineering in Information Engineering at Massey University, Albany, New Zealand
(Massey University, 2005) Liu, Zhenqing
Speech driven user interface for an intelligent house is one of a number of Graduate research projects at Massey University. It is part of Project 'Smart House'. This thesis details development of a control system whose inputs are speech signal rather than manual. The control system consists of several sub-systems including speech recognition, command generation, signal transmission, signal reception and command manipulation. The completed speech driven user interface should operate in conjunction with Real-time implementation of a Microphone Array beam-former and Personal identity recognition that were developed concurrently with this project. The speech recognition and command generation subsystems are implemented on a PC whereas the signal transmission, signal reception and command manipulation subsystems are designed at embedded board level. The remote controller can control some electrical appliances, such as TV and CD player, and switch and dim the light.
Spoken affect classification : algorithms and experimental implementation : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Science at Massey University, Palmerston North, New Zealand
(Massey University, 2005) Morrison, Donn Alexander
Machine-based emotional intelligence is a requirement for natural interaction between humans and computer interfaces and a basic level of accurate emotion perception is needed for computer systems to respond adequately to human emotion. Humans convey emotional information both intentionally and unintentionally via speech patterns. These vocal patterns are perceived and understood by listeners during conversation. This research aims to improve the automatic perception of vocal emotion in two ways. First, we compare two emotional speech data sources: natural, spontaneous emotional speech and acted or portrayed emotional speech. This comparison demonstrates the advantages and disadvantages of both acquisition methods and how these methods affect the end application of vocal emotion recognition. Second, we look at two classification methods which have gone unexplored in this field: stacked generalisation and unweighted vote. We show how these techniques can yield an improvement over traditional classification methods.
An evaluation of conversational interfaces for pedestrian navigation : a thesis presented in partial fulfilment of the requirements for the degree of Master of Information Technology, Institute of Natural and Mathematical Sciences, Massey University, Albany, New Zealand
(Massey University, 2017) Longprasert, Nattakan
The aim of this research was to compare the performance between the OsmAnd application and three types of conversational interface, to test whether the conversational interface is a more preferred navigation tool. We designed and tested four different navigation systems; the map with command interface, the conversational-only interface, the conversational with map interface, and the conversational with image interface. The research involved 100 participants who had different levels of experience when using navigation systems. Participants were divided into three groups and were given different navigation interfaces. This research was conducted with both quantitative and qualitative usability testing along a pre-defined route in Massey University campus, combined with a USE questionnaire to gain the user’s feedback. The results indicated that both the OsmAnd and the conversational interface were good in different criteria. However, most participants preferred using the conversational interface more than the visual interface.
Voice recognition system for Massey University Smarthouse : a thesis presented in partial fulfilment of the requirements for the degree of Master of Engineering in Information Engineering at Massey University
(Massey University, 2006) Gadalla, Rafik
The concept of a smarthouse aims to integrate technology into houses to a level where most daily tasks are automated and to provide comfort, safety and entertainment to the house residents. The concept is mainly aimed at the elderly population to improve their quality of life. In order to maintain a natural medium of communication, the house employs a speech recognition system capable of analysing spoken language, and extracting commands from it. This project focuses on the development and evaluation of a windows application developed with a high level programming language which incorporates speech recognition technology by utilising a commercial speech recognition engine. The speech recognition system acts as a hub within the Smarthouse to receive and delegate user commands to different switching and control systems. Initial trails were built using Dragon Naturally Speaking as the recognition engine. However that proved inappropriate for use in the Smarthouse project as it is speaker dependent and requires each user to train it with his/her own voice. The application now utilizes the Microsoft Speech Application Programming Interface (SAPI), a software layer which sits between applications and speech engines and the Microsoft Speech Recognition Engine, which is freely distributed with some Microsoft products. Although Dragon Naturally Speaking offers better recognition for dictation, MS engine can be optimized using Context Free Grammar (CFG) to give enhanced recognition in the intended application. The application is designed to be speaker independent and can handle continuous speech. It connects to a database oriented expert system to carry out full conversations with the users. Audible prompts and confirmations are achieved through speech synthesis using any SAPI compliant text to speech engine. Other developments focused on designing a telephony system using Microsoft Telephony Application Programming Interface (TAPI). This allows the house to be remotely controlled from anywhere in the world. House residents will be able to call their house from any part of the world and regardless of their location, the house will be able to respond to and fulfil their commands.
Impaired speech recognition : a thesis presented in partial fulfilment of the requirements for the degree of Master of Information Sciences in Computer Science at Massey University, Albany, New Zealand.
(Massey University, 2015) Almujil, Mohammed Nasser
The purpose of this thesis is to present a novel mobile health application that can recognize impaired speech (using audio signals) and turn it into understandable speech. The system is developed to help Dysarthria of Speech patients communicate better with others in their everyday life. It will provide some background information about motor speech disorders, Dysarthria of Speech and the technical aspects of this application. It will then explain and test the algorithms to recognize impaired speech using audio fingerprinting technology. Finally it will discuss the test results and recommends some future work to improve the current algorithms.
Gesture and voice control of internet of things : a thesis presented in partial fulfilment of the requirements for the degree of Master of Engineering in Electronics and Computer Engineering at Massey University, Auckland, New Zealand
(Massey University, 2015) Han, Xiao
Nowadays, people's life has been remarkably changed with various intelligent devices which can provide more and more convenient communication with people and with each other. Gesture and voice control are becoming more and more important and widely used. People feel the control system humanized and individualised using biological control. In this thesis, an approach of combined voice and gesture control of Internet of Things is proposed. A prototype is built to show the accuracy and practicality of the system. A Cortex-A8 processor (S5PV210) is used and the embedded Linux version 3.0.8 has been cross-compiled. Qt 4.8.5 has been ported as a UI (User Interface ) framework and OpenCV 2.4.5 employed as vision processing library. Two ZigBee modules are used to provide wireless communication for device control. The system is divided into control station and appliance station. The control station includes development board, USB camera, voice recognition module, LCD screen and ZigBee module. This station is responsible for receiving input signal (from camera or microphone), analyzing the signal and sending control signal to appliance station. The appliance station consists of relay, ZigBee module and appliances. The ZigBee module in the appliance station is to receive control signal and send digital signal to connected relay. The appliance station is a modular unit that can be expanded for multiple appliances. The system can detect and keep tracking user's hand. After recognizing user's gesture, it can control appliances based on certain gestures. Voice control is included as an additional control approach and voice commands can be adjusted for different devices.
Multi-microphone speech enhancement technique using a novel neural network beamformer : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Engineering at Massey University, Albany, New Zealand
(Massey University, 2014) Yoganathan, Vaitheki
This thesis presents a novel speech enhancement algorithm to reduce the background noise from the acquired speech signal. It introduces an innovative idea for the speech beamformer using an input delay neural network based adaptive filter for noise reduction. Speech communication is considered as the most popular and natural way for humans to communicate with computers. In the past few decades, there has been an increased demand for speech-based applications; examples include personal dictation devices, hands-free telephony, voice recognition for robotics, speech-controlled equipment, automated phone systems, etc. However, these applications require a high signal-to-noise ratio to function effectively. The background noise sources such as factory machine noises, television, radio, computer or another competing speaker, often degrade the performance of the acquired signals. The problem of removing these unwanted signals from the acquired speech signal has been investigated by various authors. However, there is still room for improvement to the existing methods. A multi-microphone neural network based switched Griffiths-Jim beamformer structure was implemented using the Labview software. The conventional noise reduction section of the Griffiths and Jim beamformer structure was improved with a non-linear neural network approach. A partially connected three-layer neural network structure was implemented for rapid real-time processing. The error back-propagation algorithm was used here to train the neural network structure. Although it is a slow gradient learning algorithm, it can be easily replaced with other algorithms such as the fast back-propagation algorithm. The proposed algorithms show promising noise reduction improvement over the previous adaptive algorithms like the normalised least mean squares adaptive filter. However, the performance of the neural network depends on its chosen parameters such as learning rate, amount of training given, and the size of the neural network structure. Tests with a speech-controlled system demonstrate that the neural network based beamformer significantly improves the recognition rate of the system.
Real-time adaptive noise cancellation for automatic speech recognition in a car environment : a thesis presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering at Massey University, School of Engineering and Advanced Technology, Auckland, New Zealand
(Massey University, 2008) Qi, Ziming
This research is mainly concerned with a robust method for improving the performance of a real-time speech enhancement and noise cancellation for Automatic Speech Recognition (ASR) in a real-time environment. Therefore, the thesis titled, “Real-time adaptive beamformer for Automatic speech Recognition in a car environment” presents an application technique of a beamforming method and Automatic Speech Recognition (ASR) method. In this thesis, a novel solution is presented to the question as below, namely: How can the driver’s voice control the car using ASR? The solution in this thesis is an ASR using a hybrid system with acoustic beamforming Voice Activity Detector (VAD) and an Adaptive Wiener Filter. The beamforming approach is based on a fundamental theory of normalized least-mean squares (NLMS) to improve Signal to Noise Ratio (SNR). The microphone has been implemented with a Voice Activity Detector (VAD) which uses time-delay estimation together with magnitude-squared coherence (MSC). An experiment clearly shows the ability of the composite system to reduce noise outside of a defined active zone. In real-time environments a speech recognition system in a car has to receive the driver’s voice only whilst suppressing background noise e.g. voice from radio. Therefore, this research presents a hybrid real-time adaptive filter which operates within a geometrical zone defined around the head of the desired speaker. Any sound outside of this zone is considered to be noise and suppressed. As this defined geometrical zone is small, it is assumed that only driver's speech is incoming from this zone. The technique uses three microphones to define a geometric based voice-activity detector (VAD) to cancel the unwanted speech coming from outside of the zone. In the case of a sole unwanted speech incoming from outside of a desired zone, this speech is muted at the output of the hybrid noise canceller. In case of an unwanted speech and a desired speech are incoming at the same time, the proposed VAD fails to identify the unwanted speech or desired speech. In such a situation an adaptive Wiener filter is switched on for noise reduction, where the SNR is improved by as much as 28dB. In order to identify the signal quality of the filtered signal from Wiener filter, a template matching speech recognition system that uses a Wiener filter is designed for testing. In this thesis, a commercial speech recognition system is also applied to test the proposed beamforming based noise cancellation and the adaptive Wiener filter.

Massey Documents by Type

Browse

Filters

Settings

Sort By

Results per page

Search Results