Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere without the permission of the Author. Spoken Affect Classification: Algorithms and Experimental Implementation A thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Computer Science at Massey University, Palmerston North, New Zealand. W, Massey University Donn Alexander Morrison 2005 Para Consuela, mi fie/ furgoneta Abstract Machine-based emotional intelligence is a requirement for natural interaction between humans and computer interfaces and a basic level of accurate emotion perception is needed for com­ puter systems to respond adequately to human emotion. Humans convey emotional information both intentionally and unintentionally via speech patterns. These vocal patterns are perceived and understood by listeners during conversation. This research aims to improve the automatic perception of vocal emotion in two ways. First, we compare two emotional speech data sources: natural , spontaneous emotional speech and acted or portrayed emotional speech. This com­ parison demonstrates the advantages and disadvantages of both acquisition methods and how these methods affect the end application of vocal emotion recognition. Second, we look at two classification methods which have gone unexplored in this field: stacked generalisation and unweighted vote. We show how these techniques can yield an improvement over traditional classification methods. ii Acknowledgements I would like to thank my supervisor, Dr. Ruili Wang for putting faith in me and allowing me to pursue this degree under scholarship. Without this financial help, it would have been unfeasible. His tireless direction and advanced motivational techniques also helped keep my focus. My co-supervisors, Dr. Liyanage C. De Silva and Dr. Peter Xu, also )ended support through­ out the research. Their extensive experience was indispensable at times when I needed support. I also owe gratitude to Pete Morrison of Mabix International for his unique insight into the research. He provided the data used in this research, and without it, it could not be possible. The second speech database was provided by Tin Lay Nwe of the National University of Singapore. This database was collected and compiled by her and was graciously provided to aid in this research. And to my partner Jen, who endured countless late nights, filled me with confidence when I lacked it, fed me when I didn't have time to feed myself, bathed me when .. . well, you get the picture. Of course, I thank my family, Bill, Debb, Bugs, Michael, Todd, and Jodybird, for their constant support and love, all the way across the Pacific Ocean. To my postgraduate friends at Massey University, who helped create a fun and relaxed work­ ing environment: Cath, Frank, Matthew, Michael, Stefan, and Yiming, among others. And Francis, for always seeming to be in Australia when I needed it the most. Last, I would like to thank the developers of the many free and open-source software tools I used during the research. Packages such as 15fp(, Gnuplot, OpenOffice, Graphviz, Vim, Octave, donnrisk, Praat, The Speech Filing System, WEKA, The GIMP, Mozilla Firefox, and most of all Debian GNU/Linux were where I spent most of my time this past year and were instrumental in the development and completion of this work. Table of Contents Table of Contents List of Figures List of Tables 1 Introduction 1.1 Introduction .. . . . . 1.2 Research motivations and applications 1.2.1 Health and public safety .. 1.2.2 Education . .. . ... 1.2.3 Fraud and crime prevention 1.2.4 Leisure and entertainment 1.2.5 Employment 1.2.6 Call-centres . . 1.3 Methodology ... . . 1.4 Structure of the thesis . 2 Foundations and background 2.1 Introduction . . . . . . . . 2.2 A brief history of emotion research . . 2.3 Theoretical representations of emotion 2.3 .1 Discrete emotion theory ... 2.3.2 Dimensional emotion theory . 2.3.3 Summary of theoretical representations of emotion 2.4 Defining the emotion classes . . . . . . . . . . . 2.4.1 Primary, secondary, and tertiary emotions 2.4.2 Primitive emotions . . . . . . 2.4.3 The basic emotions . . . . . . 2.4.4 Summary of emotion classes . iii iii vii ix 1 2 2 3 3 4 5 5 6 7 9 9 9 11 11 11 13 13 13 14 14 15 TABLE OF CONTENTS 2.5 2.6 Emotional expression in humans 2.5 .1 Channels of expression . 2.5.2 Ekman's display rules 2.5.3 The human speech production aparatus 2.5.4 Physiological responses to the emotions Review of the research on vocal emotion recognition 2.6.1 Instance-based learners . . 2.6.2 Artificial neural networks 2.6.3 Probabilistic methods . 2.6.4 Decision trees . 2.7 Areas for improvement 2.8 Summary . ... .. . 3 Emotional Speech Data Acquisition 3.1 Introduction . . . . . . . . . . 3.2 Emotional speech acquisition . 3.2.1 Natural expression 3.2.2 Induced expression . 3.2.3 Simulated expression 3.2.4 Summary of acquistion methods 3.3 Databases of emotional speech . . . . . 3.4 3.3.1 Natural data collected from a call-centre . 3.3.2 Simulated data from the ESMBS database . Summary ... ... .... . .. . 4 Acoustic Correlates to Emotional States 4.1 Introduction . . . . . . 4.2 Prosody-based features 4.2.1 Fundamental frequency and emotional speech . 4.2.2 Formant frequencies and emotional speech 4.2.3 The use of energy as an emotional marker 4.2.4 Rhythm-based characteristics 4.3 Summary . . . . . . . . . . . . . . . 5 Feature Extraction 5.1 5.2 5.3 Introduction . . Features used in past works . Chosen features . . 5.4 Extraction methods iv 15 16 17 18 20 21 21 22 23 25 25 26 28 28 28 29 30 30 31 32 32 35 36 38 38 39 39 41 41 41 43 44 44 44 45 45 TABLE OF CONTENTS 5 .4.1 Methods for pitch tracking 5.4.2 Formant frequencies .. 5.4.3 Short-time energy .. . . 5.4.4 Rhythm-based statistics 5.5 Summary ....... . . . . . 6 Classification 6.1 Introduction 6.2 Traditional classification approaches 6.2.1 Support vector machines 6.2.2 Random Forests . . . . . 6.2.3 Artificial neural networks 6.2.4 K* instance-based classifier 6.2.5 K-nearest neighbours . . 6.3 Ensemble classification methods 6.3.1 Unweighted vote . . . 6 .3.2 Stacked generalisation 6.4 Stratified cross-validation 6.5 Feature selection . . . . 6.5 .1 Principal components analysis 6.5.2 Forward selection . 6.5.3 Genetic search 6.6 Summary . . . . . . . 7 Experimental Results and Prototype Implementation 7.1 Introduction ... . . 7.2 Experimental results 7 .2.1 Performance of base classifiers . 7 .2.2 Performance of ensemble classifiers 7.2.3 Performance after feature selection 7 .2.4 Summary of results . 7 .3 Prototype implementation . 7.3.1 Endpoint detection 7.3 .2 Feature extraction. 7.3.3 Real-time processing 7 .3 .4 Classification . . . . 7 .3.5 Summary of prototype development 7 .4 Summary . . . . . . . . . . . . . . . . . . V 47 50 52 53 54 55 55 55 56 58 59 61 62 62 63 63 64 65 66 66 67 68 70 70 70 71 74 74 76 78 79 79 80 81 82 84 TABLE OF CONTENTS 8 Conclusion and Future Work 8.1 Summary of main findings 8.2 Contributions 8.3 Future work ... .... . A Emotional Speech Database Annotation System (ESDAS) A. l Introduction . A.2 How it works A.3 Screenshot . A.4 Use studies A.5 Conclusions B Other Figures Bibliography vi 85 85 86 87 88 88 88 89 90 90 91 94 List of Figures 1.1 Applications of vocal emotion recognition . . . . . . . . . . 1.2 A flow diagram of the methodology followed for this thesis. 1.3 A data flow diagram of the real-time emotion recognition system .. vii 3 6 7 2.1 The dimensional representation of emotion (from (Scherer, 2001)) 12 2.2 Channels of expression and their relation to perception in humans 16 2.3 A cross-sectional X-ray of the human speech system (from (Flanagan et al. , 1970)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 A schematic model of the human speech production system (from (Flanagan, 1972)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1 Example pitch contours for anger and neutral utterances from the NATURAL dataset. The contour for angry speech typically has a much wider range, while neutral speech is narrow and monotonous. . . . . . . 40 5.1 Data flow diagram for the feature extraction process . . . . . . . . . . . . . . . 47 5.2 Comparison between the autocorrelation method and RAPT for pitch tracking for two sample utterances from the NATURAL dataset. (a) and (b) show the differences for the first sample, and (c) and (d) show the differences for the second sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3 Block diagrams depicting the (a) extraction of and (b) reconstruction using the linear prediction coding coefficients. . . . . . . . . . . . . . . . . . . . . . . . 50 5.4 Two sample formant frequency contours calculated using a 20 ms window on example utterances from the NATURAL dataset. The first formant (Fl) has the lowest frequency, followed by F2, followed by F3 with the highest frequency. . 52 5.5 Two sample energy envelopes calculated using a 10 ms window on example utterances from the NATURAL dataset. . . . . . . . . . . . . . . . . . . . . . 53 6.1 Example support vector mapping from input space to feature space. 57 LIST OF FIGURES 6.2 An example support vector machine using the radial basis function . The support vectors are represented by the outlined shapes and constitute a maximum margin viii from the decision surface (solid line). . . . . . . . . . . . . . . . . . . . . . . 58 6.3 An example one-hidden layer artificial neural network architecture. Circles rep­ resent the nodes in each layer. The input layer contains nodes which correspond to each feature in the input vector. The output layer contains nodes that carry the result of the propagation of information throughout the network. . . . . . . 59 6.4 Illustration of Stacking and StackingC on a three-class dataset (a, b, c) with n training examples and N base classifiers. P;,jk denotes the class prediction from classifier i for class j on example k (from (Seewald, 2002b)). . . . . . . . . . . 64 6.5 An illustration of a cross-validation example where the dataset has been parti- tioned into four sets. The dark rectangle represents the partition used as the test set, and the white rectangles represent the training sets (from (Haykin, 1999)). 65 6.6 Psuedocode describing ten x ten-fold cross-validation. . . . . . . . . . . . . . 65 6.7 Psuedocode describing the forward selection algorithm. . . . . . . . . . . . . . 67 6.8 Data flow diagram describing the process of genetic search over a feature space (adapted from (Dieterle, 2003)). . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.1 Data flow diagram for prototype system 79 7.2 A sample utterance with endpoints highlighted. The dark grey regions indicate the silence preceding and following the utterance. . . . . . . . . . . . . . . . . 80 7.3 Graphical representation of the ANN architecture used in the prototype imple- mentation. . . , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.4 C++ source code function for the dynamic loading of the ANN module for clas­ sification. The module is loaded (lines 6 and 9), the address of the classification procedure is then located (lines 16 and 18), the procedure is invoked (lines 24 and 26), and finally the module is unloaded (lines 35 and 37). The feature vector corresponding to the input layer is contained in the variable in and the predic- tion corresponding to the output layer is contained in the variable out. 7.5 Screen capture of the prototype implementation .. A. I ESDAS interface . . . . . . . . . . . . . . . . . B. l The relationships between primary, secondary, and tertiary emotions (after (Par- 83 84 89 rot, 2001)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 List of Tables 3.1 Summary of the datasets used in this study. The NATURAL dataset is collected from a call-centre and the ESMBS dataset is obtained from a previous study and ix consists of utterances by non-professional actors and actresses. . . . . . . 32 3.2 Distribution of perceived speaker affect from natural corpus (NATURAL) . . 33 3.3 Sample utterances from the NATURAL database. . . . . . . . . . . . . . . . 34 3.4 Human classification performance by emotion categories (from (Nwe, 2003)) 36 4.1 Speech correlations of the basic emotions. . . . . . . . . . . . . . . . . . . . 42 5.1 38 prosodic features selected for input into classification algorithms. Features are divided into six groups: fundamental frequency (F0), first three formant frequencies (Fl , F2, F3), short-time energy, and rhythm. . . . . . . . . . 46 6.1 Initial ranking of base classification algorithms on the NATURAL dataset. 56 6.2 Results for the selection of the number of nodes in the hidden layer of the multi- layer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7 .1 Confusion matrices for the support vector machine with RBF kernel on the NAT­ URAL and ESMBS datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7 .2 Confusion matrices for the multi-layer perceptron on the NATURAL and ESMB S datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.3 Confusion matrices for the K-nearest neighbour classifier (with K = 5) on the NATURAL and ESMBS datasets. . . . . . . . . . . . . . . . . . . . . . . . . 72 7.4 Confusion matrices for the K* instance-based learner on the NATURAL and ESMBS datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.5 Confusion matrices for the random forest on the NATURAL and ESMBS datasets. 73 7.6 Confusion matrices for the StackingC classifier on the NATURAL and ESMBS datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.7 Confusion matrices for the unweighted vote classifier on the NATURAL and ESMBS datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 LIST OF TABLES 7.8 Resulting feature subsets after feature selection. PCA = principal components analysis; FW = forward selection; GA = genetic algorithm. PCA datasets have been transformed back into the original feature space for labelling purposes and X have the top 25 principal components retained. . . . . . . . . . . . . . . . . . 77 7.9 Average percentages of correctly classified instances from the NATURAL and ESMBS datasets for all classification methods. For acronyms in the dataset column, ORIG = original feature set; PCA = principal components analysis; FW = forward selection; GA = genetic algorithm. . . . . . . . . . . . . . . . . 78 7.10 Average times for feature extraction compared with the average length of an utterance in the database. The statistics calculations include the maximum, minimum, mean, standard deviation, range (for pitch, energy, formants) and speaking rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Chapter 1 Introduction 1.1 Introduction With the ever-increasing importance and reliance on computers in our society comes the un­ natural burden of interacting with those systems. This increase in human-computer interaction has, in tum, led to a marked increase in research on modelling such systems against human be­ haviour in an effort to enable more natural interaction. For this to succeed, these systems must have at least a basic level of emotional intelligence. Emotional intelligence is defined by Salovey et al. (2004) as having four branches: the perception of emotion, emotions facilitating thought, understanding emotions, and managing emotions. These will be discussed below, with the exception of emotions facilitating thought, as this assumes the ability to think independently, which current computer systems cannot. The perception of emotion is the ability to recognise emotion in oneself and others. These perceptions generally come from three channels: sight, sound, and language or contextual in­ formation present in text or prose. For example, a person may recognise that his or her friend feels distraught by the expression in the face or the tone of the voice. The perception of emotion also covers the recognition of emotion in oneself. An emotionally intelligent being is aware of the emotions expressed in itself at any time. Following perception, an emotionally intelligent being must be able to understand emotions and emotional characteristics in order to correctly process and respond to emotional informa­ tion. This consists of the knowledge of how emotions relate to one another, what causes them, what follows them, etc. Take, for example, a person who becomes angry at him or herself by missing the bus to work before an important meeting. The ability to determine the causes of this anger (e.g., the bus that is missed) is a critical part of emotional intelligence. An emotionally intelligent being will be aware of emotional changes and their nature. Emotional understanding is a prerequisite for managing emotions. An emotionally intelli­ gent being is one that can be open to all types of emotion, reflect on them, manage them in CHAPTER 1. INTRODUCTION 2 oneself, and engage, prolong, or detach from an emotional state in oneself or others (Oatley, 2004). A hypothetical situation may involve a doctor tending to a critically injured relative. The doctor must manage his or her emotions in order to operate in an effective manner. Humans feel most natural communicating with other humans because the extra information conveyed in their emotional expressions can be recognised, processed, and reflected. This infor­ mation is conveyed through several modes: facial expressions, vocal properties, bodily gestures, and behaviour. This added information helps people understand each other and interact more naturally and efficiently. The work in this thesis is dedicated to the perception of human emotion from the prosodic properties of speech. In other words, this thesis aims to build a system that can capture and interpret the vocal expression of emotion in humans. More specifically, we seek to improve on traditional emotional speech classification methods using ensemble or multi-classifier system (MCS) approaches. We also aim to examine the differences in perceiving emotion in human speech that is derived from different methods of acquisition. For example, how is the perception of acted emotion different from that of spontaneous or naturally occurring emotion? 1.2 Research motivations and applications There are wide-ranging applications for emotionally intelligent systems in real-world situations. Taking advantage of the emotional information in speech allows more effective processing of the contextual (language) information and a much more natural interaction between humans and machines. The following are some examples of how emotion recognition can yield improvement in the field of human-computer interaction. Figure 1.1 shows the relationships between vocal emotion recognition and potential application areas. 1.2.1 Health and public safety Situations in which public safety is a major issue would greatly benefit from real-time automatic affect recognition. For example, such a system could be placed in the cockpits of airliners, oceanliners, and buses, where one or two principal operators control the fate of the vessel. These systems would be used to detect pilot boredom, inattention, or fatigue (Pantie and Rothkrantz, 2003). In private vehicles, detection of anger could reduce incidents of road rage by alerting the driver and trying to make them aware of the situation (Fragopanagos and Taylor, 2005). Affect recognition could avoid concerns of having observers constantly monitoring or record­ ing in situations where security or safety is of concern. For example, in hospitals, closed-circuit security systems, prisons, etc. (Pantie and Rothkrantz, 2003). These systems could alert per­ sonnel to certain situations such as disputes, accidents, riots or fighting. CHAYrER 1. INTRODUCTION 3 Figure I. I: Applications of vocal emotion recognition 1.2.2 Education Perception of human affect is important in areas where subjects are being taught or instructed. Human teachers can recognise student boredom, fatigue, and confusion and are then able to take steps to revive attention levels, or perhaps terminate the instruction if too many students are unable to process effectively. Emotion and affect recognition from speech would be beneficial in an automated tutoring environment. The system could determine the affective states of the students and depending on how well they appear to be learning, or based on feedback (levels of frustration, confusion, boredom, fatigue, etc.), adjust the rate at which the information is presented to make the learning as efficient as possible (Picard, 1997). 1.2.3 Fraud and crime prevention Voice profiling is directly related to vocal affect recognition. Voice profiling aims to classify speech samples according to predefined psychological profiles. These profiles can be generated or trained on pathological examples. The use of voice profiling for fraud detection can be a useful measure to reduce the number of fraudulent insurance claims for insurance companies. The time needed to process claims can be reduced if claims that are potentially fraudulent are eliminated early on in the process. A system could be easily developed that allows claimants to provide information about their claim over the telephone with a disclaimer stating that their voice profile will be analysed for signs of CHAPTER 1. INTRODUCTION 4 fraud. If the analysis comes back positive for possible fraud, the customer can be notified of the result and offered an opportunity to retract their claim without penalty. Such a system does have obvious drawbacks, for example people may be discouraged from submitting a valid claim over fears of a false-positive from the voice profile analysis. Another practical use of voice profiling would be for police and security in interviewing suspects for criminal cases. Suspects could be interviewed and their speech analysed by profiling software that could detect pathological patterns correlating to lying or nervousness. As with the above scenario, however, there are many ethical issues relating to this application and its output would have to be used only as one of many sources of information during interrogation. 1.2.4 Leisure and entertainment An area ripe for new applications of emotion perception is that of leisure and entertainment. Here, the technology is applied in anecdotal ways. An example is the Sony ERS-7 Aibo Enter­ tainment Robot. This robotic pet dog learns from interaction with its "owner" and can express different emotional states. Computer video games are the result of billions of dollars of research and development investment aimed at making the player feel like he or she is experiencing reality. Emotion detection and synthesis in these games could greatly improve the gaming experience. Online games such as Everquest where human players interact with other human and computer players can benefit from both emotion recognition and synthesis to enhance the experience. Interaction with computer characters is often unnatural due to the lack of emotional understanding on the part of the computer character. Adding an affective element to these characters would introduce an entire new level to the gaming experience, providing a much more natural environment that would more closely model reality. This can be accomplished by integrating speech and facial expression recognition using cameras and microphones to measure the human player's affect. This affect can then be transmitted to other human or computer players in the game (Nakatsu et al. , 1999). The research of Breazeal and Aryananda (2002) has primarily focused on the integration of a multi-modal emotion classification system in a robot. This robot, named Kismet, responds to caretakers by way of sight and sound. An integrated affective intent classification system allows the basic recognition and modelling of primary emotions. The robot approximately models an infant that responds to affirmation, prohibition, attention and soothing. After more research, this could be extended to a more full set of emotions or affective states allowing the robot to interact naturally with human operators. CHAPTER I . INTRODUCTION 5 1.2.5 Employment Voice profiling can help streamline the processing of job applicant interviews. By interviewing applicants through an automated telephone system, the responses can be analysed for specific qualities which can be mapped to different positions within the company. For example, if a company is screening applicants for job openings in multiple departments, e.g., sales or cus­ tomer support, the applicants can be automatically sorted into groups based on how their voice profile fits the target profile for each category. Job positions where an employee is constantly interacting with customers may require specific voice qualities. An applicant with a monotone pitch contour can be screened out automatically, and an applicant with a melodic pitch contour can be placed in a sales category for further inspection. Such a system would not be designed to completely take the place of human interviewers, but can greatly reduce the time requirements for selecting candidates. 1.2.6 Call-centres Last, we look at applications of emotionally intelligent systems in call-centres. This is the pri­ mary focus of the end result of this research. Call-centres often have a difficult task of managing customer disputes. Ineffective resolution of these disputes can often lead to customer discon­ tent, loss of business and in extreme cases, general customer unrest where a large amount of customers move to a competitor. It is therefore important for call-centres to take note of isolated disputes and effectively train service representatives to handle disputes in a way that keeps the customer satisfied (Petrushin, 2000). Additionally, a team lead or manager may want to inquire on the status of any currently active calls in order to help coach new or inexperienced CSRs. Additionally, a manager can use the information provided by a spoken affect recognition system in several other ways. First, if such a system is deployed with each CSR, then a manager or senior member of staff can preview the emotional states of every caller at once, having an "overview" snapshot in real time. Other uses include the generation of statistics on the number of angry or upset callers each CSR has or whether any CSRs are being angry at the customers. This can lead to action to correct this behaviour or find things that a CSR can improve on and in turn help the call-centre more effectively manage the customer base. Automated telephone systems are another potential application area that humans find them­ selves interacting with more and more. These systems have speech recognition units that process user requests through spoken language. A spoken affect recognition system can help process callers according to perceived urgency. If a caller is detected as being angry or confused in the automated system, their call can be switched over to a human operator for assistance. This could be particularly useful for the elderly who can often be disoriented when interacting with CHAPTER 1. INTRODUCTION 6 automated telephone systems. Petrushin (2000) built a system to monitor voice-mail messages in a call-centre and prioritise them with respect to emotional content. Such systems can make interaction with automated call-centres more efficient and less daunting. 1.3 Methodology In this section we present the methodology followed during the development of this thesis. Fig­ ure 1.2 shows a flow diagram describing the methodology. Because the research focus is primar­ ily a classification problem, that being the classification of different emotions, the methodology followed is much like any other classification problem. The first step is a review of the literature relevant to the field. Previous research on automatic emotion recognition was surveyed to build a knowledge of the state of the art. Once a general knowledge of the state of the art was achieved, data had to be collected. For­ tunately, a natural speech database was provided through the partner company for this project. A second speech database was collected from a previous study on emotion research (Nwe, 2003). Unlike the natural set, this database used actors and actresses. This provided a way to compare the classification methods on different types of data as well as investigate inherent differences between the two datasets. To gain a ground truth on the natural database, a system was devel­ oped to allow human listeners to judge the emotions present in the database. Utcnlurc mriew ESMBS datase1 Figure 1.2: A flow diagram of the methodology followed for this thesis. Next, characteristics of emotional speech from the existing literature were reviewed. Promi­ nent psychologists such as Klaus Scherer who have explored emotion research for many years provide a strong basis for this area. These characteristics were extracted and compiled into feature vectors. These feature vectors describe the most relevant characteristics of emotional CHAPTER 1. INTRODUCTION 7 speech. Briefly, these include the fundamental frequency, energy, and formant frequency con­ tours as well as features relating to rhythm such as the rate of speech. Classification algorithms were then reviewed. As a starting point, artificial neural networks were experimented with, as they have proven quite useful in previous studies. These are sub­ sequently improved upon using support vector machines. Feature selection techniques such as forward selection, genetic search, and principal component analysis were compared to reduce dimensionality in the feature space. We then tested novel ensemble classification approaches in this field of using stacked gen­ eralisation and a simple voting scheme. Stacked generalisation takes as input base-classifier predictions and target classes and attempts to predict when the base-classifiers are incorrect. The voting scheme takes the predicted classes from each base-level classifier and determines the class with the greatest popularity. The last step was to build an implementation of the theoretical system. This took all previous steps, the algorithms for endpoint detection, feature extraction, the use of the feature selected sets, and classification and brought them together into a single, modular system. This application reads input from a microphone or WAVE file and outputs a prediction based on the recorded speech sample. A modular artificial neural network functions as a plug-in to facilitate efficient replacement. Figure 1.3 shows the data flow for the emotion recognition system. Speech signal End-point detection Feature extrnction & posl-processing Training data Classification Output Figure 1.3: A data flow diagram of the real-time emotion recognition system. 1.4 Structure of the thesis This thesis is organised as follows. In Chapter 2, a brief history of emotion research and the­ oretical representations of emotion are presented. This chapter also introduces the expression of emotion in humans and lists previous work in automatic spoken emotion recognition. Some areas which require additional attention are defined. Chapter 3 presents the three data acquisition methods that are applied to vocal emotion re­ search. Next, the two emotional speech datasets used in this research are introduced. The first database is collected from a call-centre and consists of natural interactions between humans. The second database is collected from non-professional actors and actresses. The advantages and disadvantages of each collection method and how it affects the research are discussed in CHAPTER 1. INTRODUCTION 8 detail. Different emotions induce different physiological changes in the body, which in turn directly affect prosodic patterns in speech. Chapter 4 formalises and reviews correlations and character­ istics of emotional speech. Building on Chapter 4, Chapter 5 explores features chosen to describe emotional content contained in speech. These features are taken from previous research and experimental features based on the formant frequencies are investigated. Chapter 6 introduces several classification algorithms used in this research. These algorithms are compared against each other in an attempt to reveal the most efficient and suitable candidate for use in the system. Feature selection methods are also compared. Next, we introduce two ensemble techniques: stacked generalisation and unweighted vote. Chapter 7 presents the experimental results based on the classification and feature selection algorithms described in Chapter 6. This chapter also offers an in-depth look at the building of a prototype emotion classification system. The system is developed using existing algorithms and is brought together using C and C++. It functions in real-time and performs automatic classification via a modular artificial neural network. Finally, Chapter 8 presents a conclusion and directions for future work.