Copyright is owned by the Author of the thesis.  Permission is given for 
a copy to be downloaded by an individual for the purpose of research and 
private study only.  The thesis may not be reproduced elsewhere without 
the permission of the Author. 
 

Spoken Affect Classification: Algorithms and 
Experimental Implementation 

A thesis presented in partial 
fulfilment of the requirements 

for the degree of 
Master of Science 

in Computer Science 

at Massey University, 
Palmerston North, New Zealand. 

W, Massey University 

Donn Alexander Morrison 
2005 


Para Consuela, mi fie/ furgoneta 


Abstract 

Machine-based emotional intelligence is a requirement for natural interaction between humans 

and computer interfaces and a basic level of accurate emotion perception is needed for com­

puter systems to respond adequately to human emotion. Humans convey emotional information 

both intentionally and unintentionally via speech patterns. These vocal patterns are perceived 

and understood by listeners during conversation. This research aims to improve the automatic 

perception of vocal emotion in two ways. First, we compare two emotional speech data sources: 

natural , spontaneous emotional speech and acted or portrayed emotional speech. This com­

parison demonstrates the advantages and disadvantages of both acquisition methods and how 

these methods affect the end application of vocal emotion recognition. Second, we look at two 

classification methods which have gone unexplored in this field: stacked generalisation and 

unweighted vote. We show how these techniques can yield an improvement over traditional 

classification methods. 


ii 

Acknowledgements 

I would like to thank my supervisor, Dr. Ruili Wang for putting faith in me and allowing me to 

pursue this degree under scholarship. Without this financial help, it would have been unfeasible. 

His tireless direction and advanced motivational techniques also helped keep my focus. 

My co-supervisors, Dr. Liyanage C. De Silva and Dr. Peter Xu, also )ended support through­

out the research. Their extensive experience was indispensable at times when I needed support. 

I also owe gratitude to Pete Morrison of Mabix International for his unique insight into the 

research. He provided the data used in this research, and without it, it could not be possible. 

The second speech database was provided by Tin Lay Nwe of the National University of 

Singapore. This database was collected and compiled by her and was graciously provided to aid 

in this research. 

And to my partner Jen, who endured countless late nights, filled me with confidence when 

I lacked it, fed me when I didn't have time to feed myself, bathed me when .. . well, you get the 

picture. 

Of course, I thank my family, Bill, Debb, Bugs, Michael, Todd, and Jodybird, for their 

constant support and love, all the way across the Pacific Ocean. 

To my postgraduate friends at Massey University, who helped create a fun and relaxed work­

ing environment: Cath, Frank, Matthew, Michael, Stefan, and Yiming, among others. And 

Francis, for always seeming to be in Australia when I needed it the most. 

Last, I would like to thank the developers of the many free and open-source software tools I 

used during the research. Packages such as 15fp(, Gnuplot, OpenOffice, Graphviz, Vim, Octave, 

donnrisk, Praat, The Speech Filing System, WEKA, The GIMP, Mozilla Firefox, and most of 

all Debian GNU/Linux were where I spent most of my time this past year and were instrumental 

in the development and completion of this work. 


Table of Contents 

Table of Contents 

List of Figures 

List of Tables 

1 Introduction 

1.1 Introduction .. . . . . 
1.2 Research motivations and applications 

1.2.1 Health and public safety .. 
1.2.2 Education . .. . ... 
1.2.3 Fraud and crime prevention 

1.2.4 Leisure and entertainment 

1.2.5 Employment 

1.2.6 Call-centres . . 

1.3 Methodology ... . . 
1.4 Structure of the thesis . 

2 Foundations and background 

2.1 Introduction . . . . . . . . 

2.2 A brief history of emotion research . . 

2.3 Theoretical representations of emotion 

2.3 .1 Discrete emotion theory ... 

2.3.2 Dimensional emotion theory . 

2.3.3 Summary of theoretical representations of emotion 

2.4 Defining the emotion classes . . . . . . . . . . . 

2.4.1 Primary, secondary, and tertiary emotions 

2.4.2 Primitive emotions . . . . . . 

2.4.3 The basic emotions . . . . . . 

2.4.4 Summary of emotion classes . 

iii 

iii 

vii 

ix 

1 

2 

2 

3 

3 

4 

5 

5 

6 

7 

9 

9 

9 

11 

11 

11 

13 

13 

13 

14 

14 

15 


TABLE OF CONTENTS 

2.5 

2.6 

Emotional expression in humans 

2.5 .1 Channels of expression . 

2.5.2 Ekman's display rules 

2.5.3 The human speech production aparatus 

2.5.4 Physiological responses to the emotions 

Review of the research on vocal emotion recognition 

2.6.1 Instance-based learners . . 

2.6.2 Artificial neural networks 

2.6.3 Probabilistic methods . 

2.6.4 Decision trees . 

2.7 Areas for improvement 

2.8 Summary . ... .. . 

3 Emotional Speech Data Acquisition 

3.1 Introduction . . . . . . . . . . 

3.2 Emotional speech acquisition . 

3.2.1 Natural expression 

3.2.2 Induced expression . 

3.2.3 Simulated expression 

3.2.4 Summary of acquistion methods 

3.3 Databases of emotional speech . . . . . 

3.4 

3.3.1 Natural data collected from a call-centre . 

3.3.2 Simulated data from the ESMBS database . 

Summary ... ... .... . .. . 

4 Acoustic Correlates to Emotional States 

4.1 Introduction . . . . . . 

4.2 Prosody-based features 

4.2.1 Fundamental frequency and emotional speech . 

4.2.2 Formant frequencies and emotional speech 

4.2.3 The use of energy as an emotional marker 

4.2.4 Rhythm-based characteristics 

4.3 Summary . . . . . . . . . . . . . . . 

5 Feature Extraction 

5.1 

5.2 

5.3 

Introduction . . 

Features used in past works . 

Chosen features . . 

5.4 Extraction methods 

iv 

15 

16 

17 

18 

20 

21 

21 

22 

23 

25 

25 

26 

28 

28 

28 

29 
30 

30 

31 

32 

32 

35 

36 

38 

38 
39 

39 

41 

41 

41 

43 

44 

44 

44 

45 

45 


TABLE OF CONTENTS 

5 .4.1 Methods for pitch tracking 

5.4.2 Formant frequencies .. 

5.4.3 Short-time energy .. . . 

5.4.4 Rhythm-based statistics 

5.5 Summary ....... . . . . . 

6 Classification 

6.1 Introduction 

6.2 Traditional classification approaches 

6.2.1 Support vector machines 

6.2.2 Random Forests . . . . . 

6.2.3 Artificial neural networks 

6.2.4 K* instance-based classifier 

6.2.5 K-nearest neighbours . . 

6.3 Ensemble classification methods 

6.3.1 Unweighted vote . . . 

6 .3.2 Stacked generalisation 

6.4 Stratified cross-validation 

6.5 Feature selection . . . . 

6.5 .1 Principal components analysis 

6.5.2 Forward selection . 

6.5.3 Genetic search 

6.6 Summary . . . . . . . 

7 Experimental Results and Prototype Implementation 

7.1 Introduction ... . . 

7.2 Experimental results 

7 .2.1 Performance of base classifiers . 

7 .2.2 Performance of ensemble classifiers 

7.2.3 Performance after feature selection 

7 .2.4 Summary of results . 

7 .3 Prototype implementation . 

7.3.1 Endpoint detection 

7.3 .2 Feature extraction. 

7.3.3 Real-time processing 

7 .3 .4 Classification . . . . 

7 .3.5 Summary of prototype development 

7 .4 Summary . . . . . . . . . . . . . . . . . . 

V 

47 
50 

52 

53 

54 

55 

55 

55 

56 

58 
59 

61 

62 

62 

63 

63 

64 

65 

66 

66 

67 

68 

70 
70 

70 

71 

74 
74 

76 

78 

79 

79 

80 

81 

82 

84 


TABLE OF CONTENTS 

8 Conclusion and Future Work 

8.1 Summary of main findings 

8.2 Contributions 

8.3 Future work ... .... . 

A Emotional Speech Database Annotation System (ESDAS) 

A. l Introduction . 

A.2 How it works 

A.3 Screenshot . 

A.4 Use studies 

A.5 Conclusions 

B Other Figures 

Bibliography 

vi 

85 

85 
86 
87 

88 

88 

88 
89 
90 
90 

91 

94 


List of Figures 

1.1 Applications of vocal emotion recognition . . . . . . . . . . 

1.2 A flow diagram of the methodology followed for this thesis. 

1.3 A data flow diagram of the real-time emotion recognition system .. 

vii 

3 

6 

7 

2.1 The dimensional representation of emotion (from (Scherer, 2001)) 12 

2.2 Channels of expression and their relation to perception in humans 16 

2.3 A cross-sectional X-ray of the human speech system (from (Flanagan et al. , 

1970)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 

2.4 A schematic model of the human speech production system (from (Flanagan, 

1972)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

4.1 Example pitch contours for anger and neutral utterances from the NATURAL 

dataset. The contour for angry speech typically has a much wider range, while 

neutral speech is narrow and monotonous. . . . . . . 40 

5.1 Data flow diagram for the feature extraction process . . . . . . . . . . . . . . . 47 

5.2 Comparison between the autocorrelation method and RAPT for pitch tracking 

for two sample utterances from the NATURAL dataset. (a) and (b) show the 

differences for the first sample, and (c) and (d) show the differences for the 

second sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 

5.3 Block diagrams depicting the (a) extraction of and (b) reconstruction using the 

linear prediction coding coefficients. . . . . . . . . . . . . . . . . . . . . . . . 50 

5.4 Two sample formant frequency contours calculated using a 20 ms window on 

example utterances from the NATURAL dataset. The first formant (Fl) has the 

lowest frequency, followed by F2, followed by F3 with the highest frequency. . 52 

5.5 Two sample energy envelopes calculated using a 10 ms window on example 

utterances from the NATURAL dataset. . . . . . . . . . . . . . . . . . . . . . 53 

6.1 Example support vector mapping from input space to feature space. 57 


LIST OF FIGURES 

6.2 An example support vector machine using the radial basis function . The support 

vectors are represented by the outlined shapes and constitute a maximum margin 

viii 

from the decision surface (solid line). . . . . . . . . . . . . . . . . . . . . . . 58 

6.3 An example one-hidden layer artificial neural network architecture. Circles rep­

resent the nodes in each layer. The input layer contains nodes which correspond 

to each feature in the input vector. The output layer contains nodes that carry 

the result of the propagation of information throughout the network. . . . . . . 59 

6.4 Illustration of Stacking and StackingC on a three-class dataset (a, b, c) with n 

training examples and N base classifiers. P;,jk denotes the class prediction from 

classifier i for class j on example k (from (Seewald, 2002b)). . . . . . . . . . . 64 

6.5 An illustration of a cross-validation example where the dataset has been parti-

tioned into four sets. The dark rectangle represents the partition used as the test 

set, and the white rectangles represent the training sets (from (Haykin, 1999)). 65 

6.6 Psuedocode describing ten x ten-fold cross-validation. . . . . . . . . . . . . . 65 

6.7 Psuedocode describing the forward selection algorithm. . . . . . . . . . . . . . 67 

6.8 Data flow diagram describing the process of genetic search over a feature space 

(adapted from (Dieterle, 2003)). . . . . . . . . . . . . . . . . . . . . . . . . . 68 

7.1 Data flow diagram for prototype system 79 

7.2 A sample utterance with endpoints highlighted. The dark grey regions indicate 

the silence preceding and following the utterance. . . . . . . . . . . . . . . . . 80 

7.3 Graphical representation of the ANN architecture used in the prototype imple-

mentation. . . , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 

7.4 C++ source code function for the dynamic loading of the ANN module for clas­

sification. The module is loaded (lines 6 and 9), the address of the classification 

procedure is then located (lines 16 and 18), the procedure is invoked (lines 24 

and 26), and finally the module is unloaded (lines 35 and 37). The feature vector 

corresponding to the input layer is contained in the variable in and the predic-

tion corresponding to the output layer is contained in the variable out. 

7.5 Screen capture of the prototype implementation .. 

A. I ESDAS interface . . . . . . . . . . . . . . . . . 

B. l The relationships between primary, secondary, and tertiary emotions (after (Par-

83 

84 

89 

rot, 2001)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 


List of Tables 

3.1 Summary of the datasets used in this study. The NATURAL dataset is collected 

from a call-centre and the ESMBS dataset is obtained from a previous study and 

ix 

consists of utterances by non-professional actors and actresses. . . . . . . 32 

3.2 Distribution of perceived speaker affect from natural corpus (NATURAL) . . 33 

3.3 Sample utterances from the NATURAL database. . . . . . . . . . . . . . . . 34 

3.4 Human classification performance by emotion categories (from (Nwe, 2003)) 36 

4.1 Speech correlations of the basic emotions. . . . . . . . . . . . . . . . . . . . 42 

5.1 38 prosodic features selected for input into classification algorithms. Features 

are divided into six groups: fundamental frequency (F0), first three formant 

frequencies (Fl , F2, F3), short-time energy, and rhythm. . . . . . . . . . 46 

6.1 Initial ranking of base classification algorithms on the NATURAL dataset. 56 

6.2 Results for the selection of the number of nodes in the hidden layer of the multi-

layer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

7 .1 Confusion matrices for the support vector machine with RBF kernel on the NAT­

URAL and ESMBS datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

7 .2 Confusion matrices for the multi-layer perceptron on the NATURAL and ESMB S 

datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

7.3 Confusion matrices for the K-nearest neighbour classifier (with K = 5) on the 

NATURAL and ESMBS datasets. . . . . . . . . . . . . . . . . . . . . . . . . 72 

7.4 Confusion matrices for the K* instance-based learner on the NATURAL and 

ESMBS datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

7.5 Confusion matrices for the random forest on the NATURAL and ESMBS datasets. 73 

7.6 Confusion matrices for the StackingC classifier on the NATURAL and ESMBS 

datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

7.7 Confusion matrices for the unweighted vote classifier on the NATURAL and 

ESMBS datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 


LIST OF TABLES 

7.8 Resulting feature subsets after feature selection. PCA = principal components 

analysis; FW = forward selection; GA = genetic algorithm. PCA datasets have 

been transformed back into the original feature space for labelling purposes and 

X 

have the top 25 principal components retained. . . . . . . . . . . . . . . . . . 77 

7.9 Average percentages of correctly classified instances from the NATURAL and 

ESMBS datasets for all classification methods. For acronyms in the dataset 

column, ORIG = original feature set; PCA = principal components analysis; 

FW = forward selection; GA = genetic algorithm. . . . . . . . . . . . . . . . . 78 

7.10 Average times for feature extraction compared with the average length of an 

utterance in the database. The statistics calculations include the maximum, 

minimum, mean, standard deviation, range (for pitch, energy, formants) and 

speaking rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 


Chapter 1 

Introduction 

1.1 Introduction 

With the ever-increasing importance and reliance on computers in our society comes the un­

natural burden of interacting with those systems. This increase in human-computer interaction 

has, in tum, led to a marked increase in research on modelling such systems against human be­

haviour in an effort to enable more natural interaction. For this to succeed, these systems must 

have at least a basic level of emotional intelligence. 

Emotional intelligence is defined by Salovey et al. (2004) as having four branches: the 

perception of emotion, emotions facilitating thought, understanding emotions, and managing 

emotions. These will be discussed below, with the exception of emotions facilitating thought, 

as this assumes the ability to think independently, which current computer systems cannot. 

The perception of emotion is the ability to recognise emotion in oneself and others. These 

perceptions generally come from three channels: sight, sound, and language or contextual in­

formation present in text or prose. For example, a person may recognise that his or her friend 

feels distraught by the expression in the face or the tone of the voice. The perception of emotion 

also covers the recognition of emotion in oneself. An emotionally intelligent being is aware of 

the emotions expressed in itself at any time. 

Following perception, an emotionally intelligent being must be able to understand emotions 

and emotional characteristics in order to correctly process and respond to emotional informa­

tion. This consists of the knowledge of how emotions relate to one another, what causes them, 

what follows them, etc. Take, for example, a person who becomes angry at him or herself by 

missing the bus to work before an important meeting. The ability to determine the causes of this 

anger (e.g., the bus that is missed) is a critical part of emotional intelligence. An emotionally 

intelligent being will be aware of emotional changes and their nature. 

Emotional understanding is a prerequisite for managing emotions. An emotionally intelli­

gent being is one that can be open to all types of emotion, reflect on them, manage them in 


CHAPTER 1. INTRODUCTION 2 

oneself, and engage, prolong, or detach from an emotional state in oneself or others (Oatley, 

2004). A hypothetical situation may involve a doctor tending to a critically injured relative. The 

doctor must manage his or her emotions in order to operate in an effective manner. 

Humans feel most natural communicating with other humans because the extra information 

conveyed in their emotional expressions can be recognised, processed, and reflected. This infor­

mation is conveyed through several modes: facial expressions, vocal properties, bodily gestures, 

and behaviour. This added information helps people understand each other and interact more 

naturally and efficiently. 

The work in this thesis is dedicated to the perception of human emotion from the prosodic 

properties of speech. In other words, this thesis aims to build a system that can capture and 

interpret the vocal expression of emotion in humans. More specifically, we seek to improve on 

traditional emotional speech classification methods using ensemble or multi-classifier system 

(MCS) approaches. We also aim to examine the differences in perceiving emotion in human 

speech that is derived from different methods of acquisition. For example, how is the perception 

of acted emotion different from that of spontaneous or naturally occurring emotion? 

1.2 Research motivations and applications 

There are wide-ranging applications for emotionally intelligent systems in real-world situations. 

Taking advantage of the emotional information in speech allows more effective processing of 

the contextual (language) information and a much more natural interaction between humans and 

machines. The following are some examples of how emotion recognition can yield improvement 

in the field of human-computer interaction. Figure 1.1 shows the relationships between vocal 

emotion recognition and potential application areas. 

1.2.1 Health and public safety 

Situations in which public safety is a major issue would greatly benefit from real-time automatic 

affect recognition. For example, such a system could be placed in the cockpits of airliners, 

oceanliners, and buses, where one or two principal operators control the fate of the vessel. These 

systems would be used to detect pilot boredom, inattention, or fatigue (Pantie and Rothkrantz, 

2003). In private vehicles, detection of anger could reduce incidents of road rage by alerting the 

driver and trying to make them aware of the situation (Fragopanagos and Taylor, 2005). 

Affect recognition could avoid concerns of having observers constantly monitoring or record­

ing in situations where security or safety is of concern. For example, in hospitals, closed-circuit 

security systems, prisons, etc. (Pantie and Rothkrantz, 2003). These systems could alert per­

sonnel to certain situations such as disputes, accidents, riots or fighting. 


CHAYrER 1. INTRODUCTION 3 

Figure I. I: Applications of vocal emotion recognition 

1.2.2 Education 

Perception of human affect is important in areas where subjects are being taught or instructed. 

Human teachers can recognise student boredom, fatigue, and confusion and are then able to 

take steps to revive attention levels, or perhaps terminate the instruction if too many students are 

unable to process effectively. 

Emotion and affect recognition from speech would be beneficial in an automated tutoring 

environment. The system could determine the affective states of the students and depending 

on how well they appear to be learning, or based on feedback (levels of frustration, confusion, 

boredom, fatigue, etc.), adjust the rate at which the information is presented to make the learning 

as efficient as possible (Picard, 1997). 

1.2.3 Fraud and crime prevention 

Voice profiling is directly related to vocal affect recognition. Voice profiling aims to classify 

speech samples according to predefined psychological profiles. These profiles can be generated 

or trained on pathological examples. 

The use of voice profiling for fraud detection can be a useful measure to reduce the number 

of fraudulent insurance claims for insurance companies. The time needed to process claims can 

be reduced if claims that are potentially fraudulent are eliminated early on in the process. A 

system could be easily developed that allows claimants to provide information about their claim 

over the telephone with a disclaimer stating that their voice profile will be analysed for signs of 


CHAPTER 1. INTRODUCTION 4 

fraud. If the analysis comes back positive for possible fraud, the customer can be notified of the 

result and offered an opportunity to retract their claim without penalty. Such a system does have 

obvious drawbacks, for example people may be discouraged from submitting a valid claim over 

fears of a false-positive from the voice profile analysis. 

Another practical use of voice profiling would be for police and security in interviewing 

suspects for criminal cases. Suspects could be interviewed and their speech analysed by profiling 

software that could detect pathological patterns correlating to lying or nervousness. As with the 

above scenario, however, there are many ethical issues relating to this application and its output 

would have to be used only as one of many sources of information during interrogation. 

1.2.4 Leisure and entertainment 

An area ripe for new applications of emotion perception is that of leisure and entertainment. 

Here, the technology is applied in anecdotal ways. An example is the Sony ERS-7 Aibo Enter­

tainment Robot. This robotic pet dog learns from interaction with its "owner" and can express 

different emotional states. 

Computer video games are the result of billions of dollars of research and development 

investment aimed at making the player feel like he or she is experiencing reality. Emotion 

detection and synthesis in these games could greatly improve the gaming experience. Online 

games such as Everquest where human players interact with other human and computer players 

can benefit from both emotion recognition and synthesis to enhance the experience. Interaction 

with computer characters is often unnatural due to the lack of emotional understanding on the 

part of the computer character. Adding an affective element to these characters would introduce 

an entire new level to the gaming experience, providing a much more natural environment that 

would more closely model reality. This can be accomplished by integrating speech and facial 

expression recognition using cameras and microphones to measure the human player's affect. 

This affect can then be transmitted to other human or computer players in the game (Nakatsu 

et al. , 1999). 

The research of Breazeal and Aryananda (2002) has primarily focused on the integration of 

a multi-modal emotion classification system in a robot. This robot, named Kismet, responds to 

caretakers by way of sight and sound. An integrated affective intent classification system allows 

the basic recognition and modelling of primary emotions. The robot approximately models an 

infant that responds to affirmation, prohibition, attention and soothing. After more research, this 

could be extended to a more full set of emotions or affective states allowing the robot to interact 

naturally with human operators. 


CHAPTER I . INTRODUCTION 5 

1.2.5 Employment 

Voice profiling can help streamline the processing of job applicant interviews. By interviewing 

applicants through an automated telephone system, the responses can be analysed for specific 

qualities which can be mapped to different positions within the company. For example, if a 

company is screening applicants for job openings in multiple departments, e.g., sales or cus­

tomer support, the applicants can be automatically sorted into groups based on how their voice 

profile fits the target profile for each category. Job positions where an employee is constantly 

interacting with customers may require specific voice qualities. An applicant with a monotone 

pitch contour can be screened out automatically, and an applicant with a melodic pitch contour 

can be placed in a sales category for further inspection. Such a system would not be designed to 

completely take the place of human interviewers, but can greatly reduce the time requirements 

for selecting candidates. 

1.2.6 Call-centres 

Last, we look at applications of emotionally intelligent systems in call-centres. This is the pri­

mary focus of the end result of this research. Call-centres often have a difficult task of managing 

customer disputes. Ineffective resolution of these disputes can often lead to customer discon­

tent, loss of business and in extreme cases, general customer unrest where a large amount of 

customers move to a competitor. It is therefore important for call-centres to take note of isolated 

disputes and effectively train service representatives to handle disputes in a way that keeps the 

customer satisfied (Petrushin, 2000). 

Additionally, a team lead or manager may want to inquire on the status of any currently 

active calls in order to help coach new or inexperienced CSRs. Additionally, a manager can 

use the information provided by a spoken affect recognition system in several other ways. First, 

if such a system is deployed with each CSR, then a manager or senior member of staff can 

preview the emotional states of every caller at once, having an "overview" snapshot in real time. 

Other uses include the generation of statistics on the number of angry or upset callers each CSR 

has or whether any CSRs are being angry at the customers. This can lead to action to correct 

this behaviour or find things that a CSR can improve on and in turn help the call-centre more 

effectively manage the customer base. 

Automated telephone systems are another potential application area that humans find them­

selves interacting with more and more. These systems have speech recognition units that process 

user requests through spoken language. A spoken affect recognition system can help process 

callers according to perceived urgency. If a caller is detected as being angry or confused in 

the automated system, their call can be switched over to a human operator for assistance. This 

could be particularly useful for the elderly who can often be disoriented when interacting with 


CHAPTER 1. INTRODUCTION 6 

automated telephone systems. Petrushin (2000) built a system to monitor voice-mail messages 

in a call-centre and prioritise them with respect to emotional content. Such systems can make 

interaction with automated call-centres more efficient and less daunting. 

1.3 Methodology 

In this section we present the methodology followed during the development of this thesis. Fig­

ure 1.2 shows a flow diagram describing the methodology. Because the research focus is primar­

ily a classification problem, that being the classification of different emotions, the methodology 

followed is much like any other classification problem. The first step is a review of the literature 

relevant to the field. Previous research on automatic emotion recognition was surveyed to build 

a knowledge of the state of the art. 

Once a general knowledge of the state of the art was achieved, data had to be collected. For­

tunately, a natural speech database was provided through the partner company for this project. A 

second speech database was collected from a previous study on emotion research (Nwe, 2003). 

Unlike the natural set, this database used actors and actresses. This provided a way to compare 

the classification methods on different types of data as well as investigate inherent differences 

between the two datasets. To gain a ground truth on the natural database, a system was devel­

oped to allow human listeners to judge the emotions present in the database. 

Utcnlurc mriew 

ESMBS datase1 

Figure 1.2: A flow diagram of the methodology followed for this thesis. 

Next, characteristics of emotional speech from the existing literature were reviewed. Promi­

nent psychologists such as Klaus Scherer who have explored emotion research for many years 

provide a strong basis for this area. These characteristics were extracted and compiled into 

feature vectors. These feature vectors describe the most relevant characteristics of emotional 


CHAPTER 1. INTRODUCTION 7 

speech. Briefly, these include the fundamental frequency, energy, and formant frequency con­

tours as well as features relating to rhythm such as the rate of speech. 

Classification algorithms were then reviewed. As a starting point, artificial neural networks 

were experimented with, as they have proven quite useful in previous studies. These are sub­

sequently improved upon using support vector machines. Feature selection techniques such as 

forward selection, genetic search, and principal component analysis were compared to reduce 

dimensionality in the feature space. 

We then tested novel ensemble classification approaches in this field of using stacked gen­

eralisation and a simple voting scheme. Stacked generalisation takes as input base-classifier 

predictions and target classes and attempts to predict when the base-classifiers are incorrect. 

The voting scheme takes the predicted classes from each base-level classifier and determines 

the class with the greatest popularity. 

The last step was to build an implementation of the theoretical system. This took all previous 

steps, the algorithms for endpoint detection, feature extraction, the use of the feature selected 

sets, and classification and brought them together into a single, modular system. This application 

reads input from a microphone or WAVE file and outputs a prediction based on the recorded 

speech sample. A modular artificial neural network functions as a plug-in to facilitate efficient 

replacement. Figure 1.3 shows the data flow for the emotion recognition system. 

Speech signal End-point detection Feature extrnction 
& posl-processing 

Training data 

Classification Output 

Figure 1.3: A data flow diagram of the real-time emotion recognition system. 

1.4 Structure of the thesis 

This thesis is organised as follows. In Chapter 2, a brief history of emotion research and the­

oretical representations of emotion are presented. This chapter also introduces the expression 

of emotion in humans and lists previous work in automatic spoken emotion recognition. Some 

areas which require additional attention are defined. 

Chapter 3 presents the three data acquisition methods that are applied to vocal emotion re­

search. Next, the two emotional speech datasets used in this research are introduced. The first 

database is collected from a call-centre and consists of natural interactions between humans. 

The second database is collected from non-professional actors and actresses. The advantages 

and disadvantages of each collection method and how it affects the research are discussed in 


CHAPTER 1. INTRODUCTION 8 

detail. 

Different emotions induce different physiological changes in the body, which in turn directly 

affect prosodic patterns in speech. Chapter 4 formalises and reviews correlations and character­

istics of emotional speech. 

Building on Chapter 4, Chapter 5 explores features chosen to describe emotional content 

contained in speech. These features are taken from previous research and experimental features 

based on the formant frequencies are investigated. 

Chapter 6 introduces several classification algorithms used in this research. These algorithms 

are compared against each other in an attempt to reveal the most efficient and suitable candidate 

for use in the system. Feature selection methods are also compared. Next, we introduce two 

ensemble techniques: stacked generalisation and unweighted vote. 

Chapter 7 presents the experimental results based on the classification and feature selection 

algorithms described in Chapter 6. This chapter also offers an in-depth look at the building of 

a prototype emotion classification system. The system is developed using existing algorithms 

and is brought together using C and C++. It functions in real-time and performs automatic 

classification via a modular artificial neural network. 

Finally, Chapter 8 presents a conclusion and directions for future work.