Copyright is owned by the Author of the thesis.  Permission is given for 
a copy to be downloaded by an individual for the purpose of research and 
private study only.  The thesis may not be reproduced elsewhere without 
the permission of the Author. 
 
A COMPARISON OF TREE-BASED AND 
TRADITIONAL CLASSIFICATION METHODS 
A thesis presented in partial fulfilment of the requirements for the 
Degree of PhD in Statistics at Massey University. 
Robert D Lynn 
1994 
ABSTRACT 
Tree-based discrimination methods provide a way of handling classification and d iscrimination 
problems by using decision trees to represent the classification rules. The principal aim of tree?
based methods i s  the segmentation of a data set, in  a recursive manner, such that the resulting 
subgroups are as homogeneous as possible with respect to the categorical response variable. 
Problems often arise in the real world involving cases with a number of measurements (variables) 
taken from them. Traditionally, in such circumstances involving two or more groups or 
populations, researchers have used parametric discrimination methods, namely, linear and quadratic 
discriminant analysis, as well as the well known non-parametric kernel density estimation and Kth 
nearest neighbour rules. 
In this thesis, all the above types of methods are considered and presented from a methodological 
point of v iew. Tree-based methods are summarised in chronological order of in troduction, 
beginning with the Automatic Interaction Detector (AID) method of Morgan and Sonquist ( 1 963) 
through to the IND method of Buntine (1992). 
Given a set of data, the proportion of observations incorrectly classified by a prediction rule is 
known as the apparent error rate. This eiTOr rate is known to underestimate the actual or true error 
rate associated with the discriminant rule applied to a set of data. Various methods for estimating 
this actual error rate are considered. Cross-validation is one such method which involves omitting 
each observation in turn from the data set, calculating a classification rule based on the remaining 
(n- 1 )  observations and classifying the observation that was omitted. This is carried out n times, that 
is for each observation in the data set and the total number of misclassified observations is used as 
the estimate of the error rate. 
Simulated continuous explanatory data was used to compare the performance of two traditional 
discrimination methods, linear and quadratic disc1iminant analysis, with two tree-based methods, 
Classification and Regression Trees (CART) and Fast Alg01ithm for Classification Trees (FACT), 
using cross-validation error rates. The results showed that linear and/or quadratic discriminant 
analysis are preferred for normal, less complex data and parallel classification problems while 
CART is best suited for lognormal, highly complex data and sequential classification problems. 
Simulation studies using categorical explanatory data also showed linear discriminant analysis to 
work best for parallel problems and CART for sequential problems while CART was also preferred 
for smaller sample sizes. FACT was found to perform poorly for both continuous and categorical 
data. Simulation studies involving the CART method alone provided certain situations where the 
0.632 error rate estimate is preferred to cross-validation and the one standard error rule over the 
zero standard error rule. S tudies undertaken using real data sets showed that most  of the 
conclusions drawn from the continuous and categorical simulation studies were valid. Some 
recommendations are made, both from the literature and personal findings as to what characteristics 
of tree-based methods are best in pa1ticular situations. 
Final conclusions are given and some proposals for future research regarding the development of 
tree-based methods are also discussed. A question worth considering in  any future research into 
this area is the use of non-parametric tests for detennining the best splitting variable. 
? l I C . - ?-
? -' -? ii 
ACKNOWLEDGEMENTS 
Firstly, I would like to thank my three supervisors, Associate Professor Dick Brook, Mr Greg 
Arnold and Dr S Ganesalingam for their constant support and helpful advice throughout my PhD 
study. I would also l ike to thank Mum and Dad, Judith and Robin Lynn, for providing me with 
cheap board and lodgings over the years as well as encouraging me to persevere to the end. I am 
indebted to Massey University for the use of their computer facilities, and in particu lar, to the 
Department of Statistics for providing me with employment over the past five years. Last, and by 
no means least, I owe a great deal of thanks to Paula McMillan for her efforts in typing this thesis, 
without her skill in reading my often illegible script this thesis may never have been completed! 
iii 
ADDITIONAL PUBLICATIONS 
Ganesalingam, S and Lynn, R D ( 199 1 ) .  Posterior probability based estimator for the overall error 
rate associated with a l inear discriminant function. Occasion al Publications in M athematics 
and Statistics, 23, M assey University. 
Lynn, R D and B rook, R J ( 199 1 ) .  Classification by decision trees and discriminant analysis. New 
Zealand Statistician, 26, pp 1 8-26. 
Lynn,  R D, B rook, R J and Arnold, G C ( 1 993).  A comparison of four  classification m ethods: 
l inear and q uadratic discriminant analysis, CART and FACT. M athematical and Information 
Sciences Report, Series B :  1, M assey University. 
iv 
Table of Contents 
1 .  IN"TROD UCTION ..................................................... ..................................................................... 1 
2. TRADITIONAL DATA DISCRIMINATION M ETHODS .......................................................... 5 
2. 1 INTRODUCTION .................................................... .... ...... ............ ....................................... 5 
2.2 LINEAR DISCRIMINANT ANALYSIS ....... .................. ........................ ............................. 5 
2.2.1 Stepwise D iscriminant Analysis ................................................. .. .......... .. ................. 1 1  
2.3 Q U ADRATIC DISCRIMINANT ANALYSIS .............................................................. ..... 12  
2.4 THE ROBUSTNES S  OF LINEAR AND QUADRATIC DISCRIMINANT 
A NALYSIS .......................................... .......................... . . ...................................... ............. 12  
2.4.1 Modifications to Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  14 
2 .5  KERNEL DENSITY ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  14 
2.6 Kth NEAREST NEIGHBOUR M ETHODS ........ ............................................................... 17 
2.7 C RITIQUES OF KERNEL DENSITY ESTIMATION AND KTH 
NEAREST NEIGHB OUR M ETHODS .............. . ...... .... ..................................................... 1 8  
3 .  A TAB U LAR COMPARISON O N  TEN TREE-B ASED M ETHODS ............ ..... ...................... 2 1  
3 . 1 O RIGINS OF TREE-BASED METHODS ................ ............... ......................................... 2 1  
3.2 I NTRODUCTIO N  ...................... .................................... ........................................ .. .. ......... 2 1  
3 . 3  A ID ........................................... ...... .................. . . ......... . ................ .............. ........................ . 3 1  
3.4 THAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  34 
3.5 ID3 ...................... ............................................................ ............................................ ......... 36 
3 .6 CHAID ........ . .. .......................................... .......... .......................... ....................................... 39 
3.7 C ART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  -............. ......................................................... 41  
3 .8  C4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .  47 
3 .9 FACT ............. . .. ................................................ ............ ....................... ................................ 50 
3 . 10 KnowledgeSeeker. .................. .......................... .................. .............. ................................... 52 
3 . 1 1  Splus Trees ( )  .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .. . .. . . .. . . ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  55 
3 . 1 2  IND ......... ............................ .................. ...................... .................................................... . .... 59 
4. SIMULATION STUDIES INVOLVING CONTINUOUS DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  65 
4. 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  65 
4.2 ERROR RATES . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  65 
4.3 SIMULATION STUDY !. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  74 
4.3 . 1 S tudy Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  74 
4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  76 
4.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  82 
4.4 SIMULATION STUDY II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  83  
4.4. 1 Study Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  83  
4.4.2 Results ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ?? ???? ? ? ?? ?? ? ? ? ?? ? ? ? ? --?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  83  
4.4.3 Summary and Discussion ? ? ? ? ? ? ? ? ?? ? ? ? ? ? --? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ????-? ? ? ?? ? ??? ? ? ? ? ?? ?? ? ? ?? ? ? 9 1  
4.5 THE EFFECTS OF PRIORS ON ERROR RATES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  93 
4.5 . 1 Introduction ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ?? ? ? ? ?? ? ? ??? ?? ?-? ? --? -? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??? ?? ? ? ? ? ? ? ?? ? ?? ? ? ? ? ? ? ? ?? ? ? 93 
4.5.2 Purpose of this study . . . . . . . . . . . . . . . .  ? -? ? ? ? ? ? ? --? --? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ???? -? ? ? ? ? ? -? ? ? ?? ? ? ? ? ?? ? ? ? ? ? 93 
4.5.3 Study Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ? -? ? ? ---? ? ? ? ? -? ?-?-? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? -? --? --? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 94 
4.5.4 Results ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? -? ? ? ? ?? ? -? ? ? ?? ? ? ? ---? --? ? -? ? ? ? ? -? ---? ? ? ???? ? ? -? ? ? ? ? -? -? ? ? -?-? -? --??? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? -? ? ? ? ?  95 
4.5.5 Summary . . . . . . . . . . . .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ? ---? ? ? ? -? ? ? ?? ----? ? -? ---? ? ?-? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? --? -? ? ? ? ? ? ? -? 104 
4.6 SIMULATION STUDY III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  _? ? ? ? ? -? --? ? ?-? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? 107 
4.6. 1 Introduction ?? ?? ? ? ? ? ? ? ? ? ? ?? ? ?? ? ? ? ?? ? ? ? ? ?? ??? ? ? -? ? ? ? ? ? ? ? ?? ? ?? ?-? ? ? ? ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? -??? ? ? ? ? ? ? ? ? -? ? ?? ? ? ? 1 07 
4.6.2 Study Plan . . . . . . . . ... . . . . . . . . . . . . . .  --? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ?? ? ? ? ? ? ?? ? ?? ? ? ? ? ?? ? ? ?? 107 
4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  108 
4.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ? -? ? -? ? -? ? ? ? --? ? --? ? ? ? ? -? ? ? ? ? ? ?? ? ? ? ? -? ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 1 2 
4_7 CONCLUSIONS ?? ?? ? ? ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ?? ? ? ? ? -? ? ? ? ? ? ? ? ? 1 1 3 
5 .  S IMULATION STUDIES INVOLVING CATEGORICAL DATA . . . . . . . . . . . . . .  -. . . . . . . . . . . . . . . . . . . . . . . . . .  1 1 5 
5 . 1 INTRODUCTION ? ? ? ? ? ? ? ? ? ? -? ?? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? --? ? ? ? ? ?? ? ? ? ? ? ? ? -? ? ? ??? ? ? ?? -----??? ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ?-? ? ? ?? ? ? ? ? ??1 1 5  
5.2 PREVIOUS STUDIES ? ? ? ? ? ?? ?? ? -? ? ? ? ? ? ??? ?? ?? ? ??? ? ? ? ?? ? -? ? ? ? ? ?? ? ? ? ? ? ? ? ? ?? ?-? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 1 6 
5.3 SIMULATION STUDY I.  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ? -? ----? -? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ????? ? ? ? ? ? ? ? ? ? ? ? ?? ?? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 17 
5 .3 . 1  Study Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ?-? ?? ? ??? ? ? ? -? ? ? ?? ??? ? ? ? ? ?? ? ? ? ?????? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? 1 17 
5 .3 .2 Results ? ??? ?????????? ? ?? ? ?? ? ? ? -? ?? ? ? ? ? ? ? ? ? ? ? ?? ? ?? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ???? ?? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ?? ? ? ?? ?? ? ? ? 1 1 8 
5 .3 .3 Summary . . . . . . . . . .  ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ????? ? ? ? ? ?? ? ? ?? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ?? ??? ? ?? ? ? ? ? ??? ? ? ? ? -? ? ??? ? ? ? 1 2 1  
5.4 SIMULATION STUDY II ................................. ............................................................... 1 22 
5 .4. 1 Introduction ............................................................................................................. 1 22 
5 .4.2 Study Plan ............................................................. ...................................... . . ........... 1 22 
5.4.3 Results .......................... ........................ .................................. .......................... .. . . . . .  1 22 
5.4.4 Summary ....... ........................................ . .................................................................. 1 26 
5.5 CONCLUSIONS .... .................................... ....................................................................... 1 27 
6. CART SIMULATION STUDY .................................. .. . ............ ....................... ......................... 129 
6. 1 INTRODUCTION ............................................................................................................. 1 29 
6.2 ERROR RATE ESTIMATION FOR CONTINUOUS DATA IN CART ................... ..... 1 29 
6.2 . 1  Previous Studies ...................... .................................................. ............................. 1 29 
6.2.2 Study Plan ...................... .................. . . ...... ...................... . . ....................................... 130  
6.2.3 Results ............................. . . ................ . . . . .... ........ ....... .............................................. 1 32 
6.2.4 Summary ............................................ .................. .................... ......... ..... ................. 143 
6.3 ERROR RATE ESTIMATION FOR CATEGORICAL DATA IN CART ............. . . ....... 1 44 
6.3 . 1 Study Plan ........................................ . . .... . . . .. . . . . . ....... ......................... ....................... 144 
6.3.2 Results . . ....... ................................................ . . ...... . . ..... ................... .............. .... ....... 145 
6.3.3 Summary ................................................ . . ............................................................... 1 5 1  
6.4 THE STANDARD ERROR RULE IN CART ............. ........................ ............................. 1 5 1  
6.4. 1 Previous Studies ................................ . . ...................................... .................... ......... 15 1 
6.4.2 Study Plan . ............... . . .................... . . ..... . . . ...................................... ......................... 1 52 
6.4.3 Results ............................ 
? 
............. . . .. . . . . . . ...................... . . . . ........ . .............................. 152 
6.4.4 Summary ....... . . . .. ........................................ .......... .......... .......... ............................... 158  
6.5 TRANSFORMATIONS OF ERROR RATES ....... ................................ ........................... 158 
6.5 . 1  Study Plan .................................................... .................... . . .............................. ....... 158  
6.5.2 Results . .............................................. ............................................. ......... ............... 1 59 
6.5.3 Summary .................................................................................................. ............ ; . . 16 1  
6.6 CONCLUSIONS ........................ ...................... ................................................................. 16 1  
7 .  CASE STUDIES ................................................................ ......................................................... 1 65 
7 . 1 INTRODUCTION ........................................ .............. ........................ ............................... 1 65 
7 .2 PREVIOUS STUDIES ........... ............ ........................................................................... . . . .  1 65 
7.3 COMPARATIVE STUDIES .................. . . ....... ......... ................... ......... ............................. 166 
7.3. 1 Methods and Data Sets ........ ....................... . ........................................................... 166 
7.3.2 Cross-Validation Error Rate Results ....... ............................................................... 173 
7.3.3 0.632 Error Rate Results .... .................. ............. ............................. ......................... 176 
7.3.4 Individual Class Error Rates ........... . ........... . ........ . ....... .... . .... .................................. 176 
7.3.5 The Standard Error Rule in CART .... ... . .. .............. .......... .............................. . . . . ..... 179 
7.3 .6 Splus Trees( ) versus CART ..... ....... . ..... .. . . ... . .................. . . .... ... . .................... ......... 179 
7.3.7 Summaiy .... .... .. ..................... . ........... . ................ ..................................................... 1 8 1  
7.4 ILLUSTRATIVE CASE STUDY .. .......... .... ...... ............ .... ........................................ ....... 1 83 
7.4. 1 Methods and Data . ............ ... ................... .................... ...... ........ . ....... ....................... 1 83 
7.4.2 Linear Discriminant Analysis .... ... . . .......... .... .... ....................................................... 1 84 
7.4.3 CART ........................................... ...... ......... . ............ ......... . .................... ................. 1 86 
7.4.4 FACT .......... .. .... ..................... . .. . ....... . ..... ............ . ... .... ... . .... .. ................................... 1 89 
7.4.5 KnowledgeSeeker.. ............. . . . . ............................ ........... . .... ....................... ...... ........ 192 
7.4.6 Splus Trees( ) ............ . ...... .. .... ................... . .. .... .......... . . ............. . . . .. .......... . .... . ......... 203 
7 .4.7 Summary ...... .. .................... ...... .... ... . . . ............ ............... . . ... .... ........ ..... . .......... ......... 208 
8. WHICH CHARACTERISTICS OF TREE-BASED METHODS ARE PREFERRED ..... ..... ... 209 
8. 1 INTRODUCTION ................. . ........ ............. ... .. ..... . ... . . ... . . . . ....... ................... ............... . ..... 209 
8 .2  WHICH CHARACTERISTICS OF TREE-BASED METHODS ARE PREFERRED? . . 209 
8.2. 1 The Method of Splitting ................... ... ...... . . ................. .......................................... 209 
8.2.2 B inary versus Multiway Splits . .... . .... . ..... .. .. ... ..... . . .. . . .... . . ....................... . ............ ... 2 1 1 
8.2.3 Univariate versus Linear Combination Splits ....... . ....... . . ........ . ............................... 212  
8.2.4 Costs and Priors . .. ..... . . ... ....... . .. .. . . .. .... .... .. .. . . . . ..... ......................... .......... ... . .... ......... 214  
8.2.5 Stopping Rules and Tree Pruning .. .... ...... ............. . .......................... ............. ... ....... 214  
8.3 HUMAN COMPREHENSIB ILITY AND USER-FRIENDLINESS OF ... .................... ... 2 16  
9 .  CONCLUSIONS AND PROPOSALS FOR THE FUTURE .............. ....................................... 22 1 
NOTATION INDEX .................................................. .................. ................................................... 235 
BIBLIOGRAPHY ........................................................................................................................... 239 
1. INTRODUCTION 
Data often arise in the real world involving many objects with a number of measurements 
(variables) taken from them. These measurements may be quantitative (continuous or 
discrete) or qualitative (ordered or unordered categories) .  The latter may, in some cases, be 
defined by only two categOiies and are then binary variables. When more than two categories 
are involved, instances where the categories can be ordered in a meaningful way are known as 
ordinal variables, while examples where the categories have no natural ordering are defined 
as nominal variables. For example, plants may be measured for stem length, stem width and 
plant height. These measurements are all continuous. A medical study would usually contain 
information on a patients age, whether he/she smokes or not and whether there is a fami ly 
history of cancer or not. Age (to the nearest year) is a discrete quantitative variable while the 
other two variables are binary. A sample survey might ask questions relating to the 
respondents ' educational qualifications, attitudes to race relations and current marital status. 
Marital status is a nominal variable while the other two vatiables are ordinaL 
Often, the objective of such studies is to distinguish between several groups or populations 
based on the measurements collected.  A botanist may be interested to know which 
measurements can best d is tinguish between two related species of plants. A medical 
practitioner would like to know what variables are best able to predict whether a person will 
develop cancer or not. A sociologist could be trying to determine if there is any relationship 
between a person's religious beliefs and various sociological and demographical variables. In 
such cases involving two or more groups or populations , a large number of methods are 
available to the botanist/medical practitioner/sociologist to handle the above types of data. 
The desired intention is that the methods will produce a set of classification or prediction 
rules, which are both accurate and informative, and serve as a basis for future decisions. 
The aim of this thesis is to study and compare the performance of classification methods, both 
tree-based and more traditional approaches, over a variety of data types with the main goal 
being to determine in which situations tree-based methods are the preferred approach. 
In Chapter 2, the focus is on traditional discrimination and the four most common methods 
for estimating the conditional density functions of each population in the data set, thereby 
approximating the B ayes rule. The four methods investigated are l inear and quadratic 
1 
2 
discriminant analysis, kernel density estimation and Kth nearest neighbour rules. The first 
two methods are based on parameter estimates while the latter two are wholly non-parametric. 
A summary table is provided which compares and contrasts each of the above four methods. 
In Chapter 3, the focus switches to tree-based classification methods, whose c lassification 
rules are portrayed in the form of a decision tree. After surveying the foundations of the tree?
based approach to classification, ten tree-based methods are presented from a methodological 
point of view, examining characteristics such as splitting cri teria, stopping rules and 
interactive and graphical ability among others, as well as critiques of each method from 
articles in the literature. To conclude the chapter, a summary table is presented comparing all 
ten tree-based methods. 
In Chapter 4, after surveying the various types of en?or rate estimates that are used in the field 
of classification, a number of simulation studies are carried out involving continuous 
explanatory data. In Section 4 . 2 ,  a comparison is made between two traditional 
d iscrimination methods, linear and quadratic discriminant analysis, and two tree-based 
methods, CART and FACT, in terms of overall accuracy, over every possible combination of 
five factors involving dimension, sample size,  Mahalanobis distance between populations, 
distribution and priors-covariance structure. In Section 4.4, the same study p lan is used 
except one of the distribution types is changed in order to make comparisons with previous 
studies. Section 4.5 deals with the estimation of individual class error rates for each of the 
four methods and how these error rates are affected when the prior probabil i ties of c lass 
membership are altered. The final section of this chapter investigates the reliability of various 
error rate estimators for three of the methods for predicting the correct class of future 
observations of the same type. 
In Chapter 5, a comparative study is undertaken comparing the four methods used in Chapter 
4 for categorical explanatory variables, in particular, five and ten-dimensional binary data. 
After providing a literature review of previous studies comparing classification methods for 
such data, a simulation study is carried out using overall accuracy as the measure of classifier 
performance. In Section 5.4, the reliability of vatious error rate estimators is determined, as 
carried out for continuous data in Section 4.6. 
Chapter 6 concentrates exclusively on the CART method. Firstly, in Section 6.2,  the 
reliability of various en?or rate estimation techniques is investigated for continuous data. Data 
sets are of varying distances between populations. sample sizes and data structure. Four 
performance criteria are used to evaluate the error rate estimators. In Section 6.3 ,  the same 
error rate estimators are compared for the categorical data sets used in Chapter 5. In Section 
6.4, the so called standard error rule used in CART is analysed while Section 6.5 explores the 
effects of transforming the error rates. 
Chapter 7, firstly, reports the results from a empi rical comparison of five classification 
methods for a number of real world data sets. In Section 7 .4, a case study is carried out using 
some family planning data from India in order to i l lustrate the approaches taken by linear 
discriminant analysis and four tree-based methods. 
Chapter 8 ,  firstly, compares the approaches taken by tree-based methods to grow a 
classification tree, through both a survey of the literature and the results of simulation and 
case studies undertaken in this thesis. Secondly, a subjective comparison of traditional 
discrimination and tree-based methods is made. A summary of the literature where critical 
assessment of the interpretability of the two approaches is presented. This is followed by a 
personal assessment of which method(s) provide the most interpretable and humanly 
comprehensible models, based on the results of simulation and empirical studies presented in 
this thesis, as well as personal experience. 
In conclusion, a set of recommendations is made, based on the findings of this thesis, as to 
which methods should be used in which situations. Some proposals for the future 
development of tree-based programs and research are also presented, after tracing the links 
and developments of tree-based methods. 
3 
4 
2. TRADITIONAL DATA DISCRIMINATION METHODS 
2. 1 INTRODUCTION 
The optimal rule of classification in a p-dimensional, k-class problem is the Bayes rule which 
is defined to be 
(2. 1 . 1 )  
where fi(x) is the conditional density of x, given that x belongs to class i and 1ti is the prior 
probabil i ty that x belongs to c lass i .  The optimal rule for the proportion of observations 
falsely classified is called the Bayes misclassification error rate. This is calculated as 
(2.1.2) 
It is very unusual, however, for either the fi(x) or the 1ti to be known. The 1ti c an easily be 
estimated by class sample proportions but the fi(x) are another matter. 
This chapter focuses on the four most commonly used methods for estimating the fi(x), 
thereby approximating the Bayes rule. The four methods, which attempt to correctly classify 
a random observation into one of k classes, are linear and quadratic discriminant analysis, 
kernel density estimation and Kth nearest neighbour rules. The methods are described both 
algebraically and in words. A table of the assumptions and properties of the four methods is 
presented in conclusion. 
2.2 LINEAR DISCRIMINANT ANALYSIS 
Suppose that an object is to be allocated to one of two p-dimensional multivariate ellipsoidal 
populations on the basis of an observation vector x. Let us assume that observations from the 
first population, f11, occur in a proportion n1 and the remainder are from f12 in the proportion 
n2 = ( 1  - n1). Let fi(x ) be the multivariate density of x in ni, with mean P.i and covariance 
matrix Li, where 
(2.2 . 1 )  
5 
6 
Suppose that we assign x toi11 if x is in some region A1 and to I12 if x is in a region A2 
where AI and A2 form an exhaustive and mutually exclusive partition of the sample space, 
that is, Pr(A1 n A2) = 0 and Pr(A1 u A2) = 1 .  Then, the total probability of misclassification, 
T(A, f), is the proportion of observations from A I that are falsely classified as belonging to 
A2 and vice-versa. Thus 
T(A, f) = 1t1 I A f1 (x) dx + 1t2 I A f2(x) dx 2 I 
= 1ti[l -I A fi(x)dx] + 1t2 I A? f2(x) dx I I 
= 1ti +I A [1t2 f2(x) - 1t1 f1 (x)] dx I (2.2.2) 
T(A, f) will be a minimum if 1t2 f2(x) - 1t1 fi (x) < 0 for all observations in A1- That is, the 
minimum error will occur if the product of the class priors and density functions in A 1 is 
much larger for I11 than for I12- With the assumption that LI = I.2 =I., that is covariance 
matrices are equal, the optimal rule of allocation D(x) assigns x to I11 if 
(2.2.3) 
otherwise X is assigned tO ill> where the likelihood ratiO f1 (x)/f2(x) is given by 
(2.2.4) 
Taking logarithms produces the rule: assign x to I11 if 
(2.2.5) 
otherwise to I12. 
The above quantity, D(x), is known as the true discriminant function. D(x) is a linear 
function of x. Now if x is multivariate ellipsoidal then D(x) will also be multivariate 
ellipsoidal thus the means and variances of D(x) can also be used to calculate the estimated 
error rates from using D(x) as the allocation rule. Now E[D(x) I TI J] is the mean value of 
D(x) given that X is from n 1? Thus 
E[D(x) I OJl = [Jl1 -t (Jl1 + Jl2)]' I-1(Jl1- Jl2) 
= ? (Jl1- Jl2)' I-1(Jl1- Jl2) 
= 1. ?)2 2 
where 82 is the square of the true Mahalanobis distance between 01 and TI2. 
Similarly 
The common va1iance can be calculated thus: 
E[D(x)- D(Jli)]2 = E[(x- Jli)' I,-1(Jl1 - Jl2)]2 
= E[(Jl1 - Jl2Y I-1(x- Jlj)(x- Jlj)' I-
1CJl1- Jl2)] 
= (Jl1- Jl2)' I,-1 E[(x- Jli)(x- Jlj)'] I,-l(JlJ - Jl2)] 
= (JlJ - Jl2)' I,-1(Jl1- Jl2) as E[(x- Jli)(x- Jli)'] =I 
= ?)2 
Let R 1 (T) be the probability of misclassifying an observation from 01, so that 
R1(T) = Pr[D(x) < ln(rr2/n1) I x ? OJl 
Under the assumption of normality (2.2.9) can be expressed by 
R (T) _ ?? D(x)- E(D(x)) (n2/1t1)- E(D(x)) J 1 - ? se[D(x)] < se[D(x)] 
and 
where <I>(.) is the cumulative normal distlibution function. 
(2.2.6) 
(2.2.7) 
(2.2.8) 
(2.2.9) 
(2.2. 10) 
(2.2.11) 
7 
8 
If a sample of size n1 is drawn from f1 1 and size n2 from IT2 then J.l.i can be replaced by the 
Tij 
sample estimate xi = ? xi/ni, (i = 1 ,  2) and I by the estimate of the pooled sample variance, 
J 
SP, given by 
SP = [(n1- 1 )S1 + (n2-1)S2J I (n1 + n2 - 2) (2.2. 1 2) 
where Si are the estimates of Li, (i = 1 ,  2). If these sample estimates are placed into equation 
(2.2.5), then the optimal sample allocation rule is to assign a random observation x to IT 1 if 
(2.2. 1 3) 
D(x) is the linear discriminant function, the sample estimate of D(x). This discrimination rule 
assumes that the cost of misclassifying an observation from f1 1 to IT2, C(21 1) ,  is the same as 
misclassifying an observation from f12 to IT1, C( l l2). I f C(21 1 )-:?:- C( l l2), then (2.2 . 1 3) takes 
the form given so that an observation x is assigned to n 1 if 
D(x) > C( l l2) ln rr2 I C(21 1 )  In rr1 (2.2. 14) 
otherwise to IT 2. 
In the case of equal a priori probabilities of belonging to a certain class, (2.2 . 1 3 )  simplifies to 
"classify X to IT1 if D(x) > 0", otherwise to IT2-
An alternative way of tackling the problem of classification in the linear discriminant analysis 
context is to use the group classification functions, f.;Cx), where 
L(x) = ln 1t- + x'? s?1 (x _l x?) 1 1 1 p 2 1 
l 
1 _, s?I _ _, s-I = n 1ti - -;;- X i Xi + X i Xi 
? p p 
= a+ b' x. 
(2.2. 1 5) 
(2.2. 16) 
In the case of k groups, there are k group classification functions, so the rule is to assign x to 
Dmif 
? ? 
l..m(x) = maxi l;(x), i = 1 ,  . . .  , k (2.2. 17)  
The above Li(x) can be used to form what has previously been called the l inear discriminant 
function, D(x), in the case of k = 2 groups, where 
D(x) = L1 (x) - ?(x). (2.2. 1 8) 
In the case of  k ? 3 groups, the situation becomes more complex. A set of linear discriminant 
functions, DijCx), can be defined as 
A A A 
DijCx) = 4Cx) - LjCx), i, j = 1 '  . . .  , k (2.2. 1 9) 
Jennrich ( 1 977) calls these type of l inear discriminant functions, group separation functions. 
In general, there are c; = k!/2 ! (k-2)! = k(k- 1 )/2 such group separation functions in a sample 
consisting of k distinct classes. The rule is to assign x to ni if 
Dij(x) > 0, 'tf i < j .  and 
Dij(x) < 0, 'tf i > j (2.2.20) 
Otherwise, x is  assigned to one of the other (k- 1 )  classes. For example, in the case of k = 4 
classes, the rule is 
Assign to nl if DljCx) > 0, j = 2, 3 ,  4 
Assign to n2 if DJ2(x) < 0, 623(x) > 0 and 624(x) > 0 
Assign to n3 if DJ2(X) < 0, D23(x) < 0 and 634(x) > 0 
Otherwise assign X tO 04. 
Figure 2. 1 i llustrates this procedure for a set of twenty six urinary samples (see data set R 
from Table 7 . 1 )  involving two chemical measurements (androsterone and etiocholanolone) 
taken from eleven healthy heterosexual and fifteen healthy homosexual males. Linear 
discriminant analysis is ideally suited to this problem, with the linear discriminant function 
(LDF), providing perfect discrimination between the two classes, showing that for men with 
the same values of androsterone, homosexuals have higher values of etiocholanolone than 
heterosexuals. Therefore. the separation is in a linear combination of the two variables. 
9 
Figure 2.1 
Plot of Etiocholanolone (mg/24 hours) against Androsterone 
{mg/24 hours) from Urinary Samples for 26 Healthy Heterosexuals 
and Homosexuals with Linear Discriminant Function 
5-
(l) 
? 
o 4 ? B -
0 
? 
? -
0 3-,..q 
() 
0 
?-
__, 
?2-
B 
B 
B 
B / / --------- - ---
I 
1.5 
B 
B 
B 
/ 
/ 
/ 
/ 
,.,. ,. A / 
/ 
B 
B 
B 
/ / 
/ / / 
B .--/ .--
/ 
/ / 
/ .--
/ 
/ 
B.," ;/ 
/ AA 
A 
A 
I 
2.5 3.5 
Androsterone 
/ 
8 
LDF 
/ 
/ 
/ 
/ / 
/ / 
/ 
/ 
B.," 
/ 
/ 
/ / 
/ 
A 
A 
A 
A A 
A 
4.5 
A = heterosexual s B = homosexuals 
2.2 . 1  Stepwise Discriminant Analysis 
A special application of l inear discriminant analysis is stepwise d iscriminant  analysis, 
whereby only a subset of the original p variables is selected to carry out the discriminant 
analysis. As above, suppose that a sample of dimension p contains n 1 observations from D 1 
and n2 observations from D2. Variables are chosen to either enter or leave the model 
according to whether the Wilks-Lambda ratio of between to within class variance is greater 
than or less than a pre-specified significance level, while also taking into account the 
variables that are already in the model . Alternatively, the partial correlation coefficients 
between each predictor variable and the class variable can be used to force a variable to either 
enter or leave the model. In essence, the variables that contribute most to the discriminatory 
power of the model are selected to carry out the discriminant analysis. However, authors 
such as Habbema et al ( 1974) have pointed out that the best q variables selected by stepwise 
discriminant analysis, may not necessarily be the "best" variables for this type of data, just the 
best for this particular sample. Snapinn and Knoke ( 1 989) give i l lustrations where the 
apparent error rate, the error rate found from resubstituting the original sample, should never 
be used on a data set which contains only the best q variables, selected  by stepwise 
discriminant analysis, though this has been found to hold for most discrimination methods. 
1 1  
2.3 QUADRATIC DISCRIMINANT ANALYSIS 
In real world situations, the requirement of equal covariance matrices is rarely satisfied, 
though the differences are often too small to cause any deterioration in the performance of 
l inear discriminant analysis. In cases where the covariance matrices are quite different, 
though, and x is p-dimensional multivariate ell ipsoidal, quadratic discriminant analysis is the 
appropriate method to use where the discriminant function is 
Q(x) = ln[f1 (x) I f2(x)] (2.3. 1 )  
and we assign x to f11 if Q(x) > ln(n2/n 1 ) .  If we again replace Jli and I,i by  the sample 
estimates xi and Si then the result is the sample estimate of Q(x) ,  the quadratic discriminant 
function: 
Q
? c ) 1 1 [1s1? 1 c _ )' s?Ic - ) 1 c - )' 8-1c - ) x =- 2 n IS21j - 2 x - xl 1 x - x l + 2 x - x2 2 x - x2 (2.3.2) 
(2.3 . 3) 
2.4 THE ROBUSTNESS OF LINEAR AND QUADRATIC 
DISCRIMINANT ANALYSIS 
1 2 
Simulation studies previously undertaken by authors such  as Lachenbruch et al ( 1 973), Marks 
and Dunn ( 1974), Aitchison et al ( 1 977) .  Krzanowski ( 1 977) and Wahl and Kronmal ( 1 977) ,  
among others, have made many interesting discoveries about the robustness of linear and 
quadratic disc1iminant analysis, henceforth called LDA and QDA, respectively. 
Seber ( 1 984) has summarised many of these findings, noting in particular that: 
(i) LDA and QDA should perform equally well when covariances are roughly equal and 
the number of variables, p, is small (p::::; 6) . 
(ii) For small samples (n 1 , n2 < 25)  and srriall cova1iance differences and/or p large, LDA is 
prefeiTed,  but when both covariance differences and p are large neither method is 
recommended. 
(iii) QDA is better than LDA when both covariance differences and the number of variables 
are large, p > 6 and when sample sizes are large. It is suggested that n 1 = n2 = 25 and 
p = 4 as a minimum with 25 additional observations per c lass for every extra two 
dimensions. 
(iv) QDA should not be used in poorly posed situations, that is where the number of 
valiables is not much less than the c lass sample sizes, resulting in Si being a poor 
estimate of Li ? The extreme case occurs where the data is il l -posed, when p > ni 
meaning Si does not exist. 
From the above findings and many applications in case studies, both LDA and QDA should 
be best when each class is multivariate n01mal with equal covariances matrices and the ratio 
of class sample size to dimension is large. LDA is fairly robust to any departures from these 
conditions, while QDA is only robust to differences in the class covaliance matlices. 
Morgan and Sonquist ( 1 963), while introducing their Automatic Interaction Detector (AID) 
program, came to the conclusion that the usual parametric methods of c lassification were 
often inadequate in analysing survey data, noting in particular that parametlic methods were: 
(i) Unable to handle interaction effects, without adding many extra terms to the model, as 
interactions may be quite complex, affected in different ways by different parts of the 
data set. 
(ii) Variables may not have linear effects, thus there is the need to create many extra terms 
(for example, quadratic, cubic etc). 
(iii) Not good at handling categorical explanatory variables, especially those with many 
categories. Parametric methods usually treat categorical explanatory variables as a 
number of binary variables and c reate linear functions from those variables. As with (i) 
and (ii) above, the number of variables in the data set could increase dramatically and 
the data matrix will be sparse. 
(iv) Not robust to errors in the variables such as decimal points in the wrong place. As 
parametric methods make use of all the data at once, any errors in  the measurements 
will lead to false classification rules. 
(v) Affected adversely by intercorrelations among the explanatory variables used in the 
analysis. These corTelations interfere with assessing the importance of individual 
variables. 
13 
2.4.1 Modifications to Linear Discriminant Analysis 
Friedman ( 1989) and Raveh ( 1 9 89) have tried two modifications to LDA in an attempt to 
solve the problems mentioned in the last section. Friedman developed a method called 
regularized discriminant analysis (RDA) especially for ill-posed situations, as outlined 
earlier. He noted that when the sample covariance is singular then the p - ni + 1 smallest 
eigenvalues are estimated to be 0. The net effect of this biasing phenomenon on discriminant 
analysis is to, sometimes, dramatically exaggerate the importance associated with the low 
variance subspace spanned by the eigenvectors corresponding to the eigenvalues near zero. 
Friedman tackles the problem using regularization. whereby a reduction in the variance of the 
sample-based estimates is caiTied out so as to minimise a potentially increased bias. Two 
regularization parameters, 0 ? A. ? 1 and 0 ? y ? 1 ,  are selected in order to jointly minimise 
future misclassification enors. The above two parameters are incorporated into a variance 
function that controls the degree of shrinkage of the individual class covariance matrices that 
contribute to the pooled estimate. Simulation studies showed that RDA was much better than 
LDA and QDA in cases where the covariances were spherical. In those cases where the 
covariance matrices were highly ellipsoidaL and equaL LDA did best but when covariance 
matrices were unequal. RDA did best. 
Raveh developed non-metric discriminant analysis (NDA). a method that requires none of the 
parametric assumptions required by both LDA and QDA (for example, the assumption of 
multivariate normality). NDA uses a separation measure so that as many observations as 
possible from n 1 are greater than or less than the observations from D2. Thus NDA is based 
solely on the ranks of the individual observations and not the actual values. Through 
simulation studies, Raveh has shown that NDA is error-free for non-overlapping distributions 
and that NDA outperforms LDA in cases where the distribution of the data is highly non?
normal or where cova1iance matiices are quite different. 
2.5  KERNEL DENSITY ESTIMATION 
14 
Often, it occurs that a parametric form cannot be assumed for the fi(x) so that in order to 
apply the likelihood-ratio test, the fi(x) have to be estimated using an unstructured approach. 
An example where this approach is necessary is in a sample exhibiting gross non-normality 
and unequal covariance matrices. Such an approach is called non-parametric estimation. 
'Kernel density estimation is one fonn of non-parametric estimation. Hand ( 198 1 ), Seber 
( 1 9 84) and Fukunaga (1990) all give excellent summaries of how kernel density estimation 
works. 
The basic idea behind kernel density estimation is to use the sample data (xij? i = 1 ,  . . .  , k and 
j = 1 ,  . . .  , ni) to estimate each of the fi(x)'s. Hand ( 1 98 1 )  first considers the case p = 1 .  
Suppose that v(xlf1m) is the number of sample points belonging to class m, 1 :5 m :5 k, with 
values less than or equal to x andF(xlf1m) is the estimate of the cumulative distribution and 
is given by 
F x/f1 _ number of class m observations :5 x ( m) -total number of class m observations 
(2.5. 1 )  
This function cannot be differentiated because the probabilities are not continuous, but an 
approximation to the derivative of F(xlf1m) can be made. 
f 
x/f1 _ 
F(x + h/f1m)- F(x - h/f1m) 
( m) - 2h 
[v(x+hlf1m)- v(x-h!f1m)J/nm 
= 2h 
This can then be rewritten as 
A 1 ?  ko (x - xi) f(xlf1m) = -h L., h nm i=l (2.5 .3)  
where xj, i = 1 ,  2, . . .  , nm are the class m sample observations and {0, 
ko(z) = .!. 
2' 
for lzl > 1 
for lzl :5 1 
where z = (x - xj) I h. 
(2.5.2) 
(2.5.4) 
15 
16 
The above implies that every point in the interval (x-h, x+h) contributes l/2nmh to the 
estimation of the density function at x while any points that lie outside of that interval 
contribute nothing. It seems wrong that a point near the boundary of (x-h, x+h) carries the 
same weight as a point very close to x, while a point just outside of the interval contributes 
nothing. To overcome this problem, a smoothing weighting function is used. For instance, 
let ko(z) be from the normal distlibution with zero mean so that observations closest to x have 
the greatest weighting but all observations in the sample contribute to some degree in the 
calculation of the density function. Another alternative would be to use the uniform 
distribution as the weighting function so that every observation is equally weighted. "Any 
other unimodal density could be used as a kernel.., (Seber, 1 984, p 322.) 
Classification is determined by use of the likelihood ratio statistic, -ln(f1 (x)/f2(x)), and 
whether this value is greater than or less than a threshold value. In the case of two 
populations and p ? 2, the kemel density disc1imination function, K(x), is given by 
n2 
( l/n2) I, k2 (x - x2j) 
j=l 
An observation is assigned to n 1 if 
K(x) > ln (:?) 
(2.5.5) 
(2.5.6) 
otherwise to n 2. That is, the density estimates of fi(x) are based on the number of points 
from ni within the region (x - h, x + h) where h is a p-dimensional area. 
2.6 Kth NEAREST NEIGHBOUR METHODS 
The Kth nearest neighbour method (K-NN) is another tool that is used whenever the c lass 
density functions, fi(x) ,  are unknown. In fact, this was the first non-parametric method for 
classification and was introduced by Fix and Hodges ( 1 95 1 ). 
The idea behind the method is relatively simple. Cover and Hart ( 1967) define a random 
observation X m, X m ? { x 1,  . . . , x11 } ,  as the nearest neighbour to x if 
min d(xj , x) = d(xm, x) ,  j = 1 ,  2, . . .  , n .  (2.6. 1 )  
where d(xj , x) is a distance function. The nearest neighbour rule decides that x belongs to the 
class fl i of its neighbour Xm. The above is the single nearest neighbour rule, that is K = 1 ,  
and only applies to the single nearest neighbour to x.  All other observations are ignored. 
The idea is extended naturally to the K nearest neighbours of x . Lachenbruch ( 1 97 5 )  
describes the general K-NN rule as follows. Suppose there are n 1 and n2  sample observations 
from TI1 and TI2 respectively. Suppose that the objective is to classify an observation x to one 
of TI 1 or TI2. Using a distance function, d(xij ? x), order the values, xij ? Let Ki be the number 
of observations from fli among the K closest observations to x. The rule is to assign x to TI 1 
if 
(2.6.2) 
otherwise to TI2.  In other words, the procedure involves the relatively s imple concept of 
assigning a random observation x to the class having the greater proportion of observations 
closest to x. As ni -7 oo, it has been found that (2.6.2) tends to the maximum likelihood rule. 
17 
2. 7 CRITIQUES OF KERNEL DENSITY ESTIMATION 
AND KTH NEAREST NEIGHBOUR METHODS 
18 
Simulation studies by various authors, including Habbema et al ( 1 974), have shown that the 
kernel density method for classification was just as efficient as LDA in the case of normally 
distributed data but when non-normality occurred, kernel density estimation was superior. 
Feng et al ( 1 993) have shown that the K-NN method were often slow in terms of running 
time, as was kernel density estimation. For most of the case studies tested in  that paper, the 
K-NN method produced a very low apparent error rate but quite often the test sample error 
rate found from the classification rules on another set o f  data that was not used to construct 
the classifier, was comparatively high. This fact calls into question the reliability of the 
classification rules proposed by the K-NN method. 
B reiman et al ( 1 9 84 ) , p 1 7 , have criticised both the above non-parametric classification 
methods on the following grounds: 
(i) They are sensitive to the choice of a metric l l x l l  and there is usually no intrinsically 
preferred definition. 
(ii) There is no natural or simple way to handle categorical variables and missing data. 
(iii) They are computationally expensive as classifiers. The learning sample must be stored, 
the inter-point distances and classification !1lle recomputed for each new observation. 
(iv) Most serious, they give very little usable inf01mation regarding the structure of the data. 
That is, neither of the two methods provide a set of simple and intuitive set of 
classification rules. 
2.8 SUMMARY TABLE OF THE ASSUMPTIONS AND PROPERTIES OF TRADITIONAL DATA DISCRIMINATION 
METHODS 
ISSUE 
Optimality 
Types of Variables 
Computations 
Discrimination 
Rule 
Critiques of 
the Method 
KEY 
Optima lily 
Types of Variables 
Computations 
Discrimination Rule 
Critiques of the Method 
LDA QDA 
Mult ivariate ell ipsoidali ty and Multivariate el l ipsoidality and 
equal covariance matrices within equal covariance matrices within 
each group. each group. 
Ouantitati ve Quantitative. 
Computations are based on class Computations are based on class 
sample means and t11e pooled sample means and individual 
covariance matrix of t11e class class sample covariances. 
sample covariances. 
Assign X to n 1 i f  Assign X to  n 1 if  
D(x) > In (1t2ht 1 )  Q(x) > I n  (n2/n 1 )  
where where 
D(x) = In [f1 (x)/f2(x)] .  Q(x) = In [f1 (x)/f2(x)J .  
Ot11erwise, assign x to fh Ot1lerwise, assign X to n2. 
? Very fast.  ? Robust to departures from 
? Robust to mild non-normali ty equal covariance matrices. 
in t11e variables. ? Not suited when the ratio of 
? Unable to properly  handle d imension to sample size is 
interaction effects. smaJ I.  
? Not good at handling ? Very sensitive to departures 
categorical predictor variables. from normality in t11e 
variables. 
? Not good at handl ing - - _ _f?tegorical predictor variaples. 
Under what conditions is the met110d opt imal? 
ror what type of explamtory variables is the method suited? 
KERNEL DENSITY 
No assumptions about the present 
distribution of variables. 
Ouantitati ve. 
Uses the individual data values 
and a weighting function k0(z). 
Assign X to n 1 i f  
K(x) > In  (n2/n 1 )  
where 
Ill 
L k t (X-X 1j) 
, c2) j= l K(x) = - . Ill 112 
I, k2(x-x2j) 
j= 1 
Otl1erwise, assign to Il2. 
? Not affected by eitl1er non-
normality or unequaJ 
covariance mat1iccs. 
? Produces reliable classificatjon 
rules. 
? Gives very l it t le usable 
infonnation about the data. 
? Not good at handl ing 
categoricaJ predictor variables. 
What statistics/values are used in the calculation of t11c discrimination mles? 
How is an observation x classified to one of the two populations? 
From t11c l i terature, what arc four key advantages or disadvantages of the method? 
K-NN 
No assumptions about the presen t  
distribution of variables. 
Quantitative. 
Uses tJ1e K observations U1at are 
closest to x .  
Assign X to  n 1 i f  
K 1/n 1 > K2/n2 
where Ki is t11e number of class i 
observations among t11e K nearest 
neighbours to x.  
Otherwise, assign X to  n2? 
? Not affected by either non-
normality or unequaJ 
covariance matrices. 
? Produces unreliable 
classi fication rules. 
? Gives very little usable 
infonnation about t11e data. 
? Not good at  handling 
categorical predictor variables. 
20 
3. A TABULAR COMPARISON ON TEN TREE-BASED METHODS 
3. 1 ORIGINS OF TREE-BASED METHODS 
Tree-based methods of classification are children of the computer age. The idea of decision 
trees could only have been dreamed of before the introduction of the computer as the amount 
of number crunching required to construct a data-intensive method, such as a decision tree 
classifier, would have been far too much for the simple adding machine. 
The ideas behind decision tree methods were originally developed by Belson ( 1 959). The 
approach he proposed to take was the binary segmentation of a data set, in  a recursive 
manner, so that each of the subgroups formed would be as homogeneous as possible with 
respect to the response variable. At each stage of the analysis the predictor variable providing 
the "best" dichotomous pa11ition would be chosen to partition the subgroup into two further 
subgroups. 
Belson ' s proposals form the foundations from which all tree-based methods have been built, 
being the result of a dissatisfaction with standard statistical techniques. In conclusion, Belson 
states, " [t]he method as I have described it is, it is true, a movement towards a more empirical 
way of doing things; but it is just as much a movement away from a sophistication which is 
too often either baffling or misleading" (Belson, 1959, p 75) .  
3.2 INTRODUCTION 
In this chapter, ten tree-based methods are to be summarised in chronological order of 
introduction. The methods are: 
(i) AID Automatic Interaction Detector. 
(ii) THAID THeta AID 
(iii) ID3 
(iv) CHAID CHi-squared AID 
(v) CART Classification And Regression Trees 
(vi) C4.5 
(vii) FACT Fast Algo1ithm for Classification Trees 
21 
22 
(viii) KnowledgeSeeker 
(ix) Splus Trees() 
(x) IND 
The ten methods are to be tabulated on the following bases: 
Author(s) 
Introduction 
Classification/Regression 
Tree Growth 
Tree Pruning 
Validation Procedures 
Interactive Ability 
Graphical Ability 
Who developed the method?  
The year the method was introduced and a short summary 
of how the author(s) describe(s) the method. 
What type of response variable is handled? 
Is the tree grown on all the data or only on a subset of the 
data? 
(a) 
(b) 
(c) 
(d) 
(e) 
Splitting Method. What rules are used by each 
method to partition the data? 
Type of Splits.  How does the method partition the 
data? B inary/Multiway splits on a single variable 
(US) or a linear combination of variables? 
Costs/Priors. Are these incorporated into the 
splitting algorithm? 
Stopping Rules. What types of stopping rules are 
employed? 
Node Classification/Prediction. How are the nodes 
classified/predicted? 
What pruning procedure, if any, exists in the method? 
Is there validation of the decision rules constructed by 
means of a test sample or cross-validation?  
Do facili ties exist in  the program for the user to  easily 
interact wi th the tree-growing procedures, so as to 
automatically change the splitting variable, stop splitting, 
etc? 
Can the program display the decision tree graphically? 
One-Stage Optimality 
Missing Values 
Criticisms 
Examples in the literature 
Does the method only look for the optimal split of the 
current node? If not, do facilities exist to examine the 
effects of splits at the next one or two stages of the tree?
growing process? 
How are missing values handled? 
What is written about the program in the literature? What 
problems have been identitied? 
A list of important papers using the method. 
Finally, a short, summary table comparing all ten tree-based methods is given, over all the 
attributes described above_ 
Safavian and Landgrebe ( 1 99 1 )  have conducted a survey of a large number of tree-based 
methods. Their paper includes a summary table comparing each of the methods in terms of 
the assumptions each approach makes, their performance criterion and some of the specific 
requirements for each method. The approach adopted in this chapter is intended to be more 
than a mere enumeration of material or collated bibliography of tree-based methods. 
Therefore, only ten such methods have been selected and presented in detail with some 
attempt at critical comparison. 
A major difference between the tree-based methods studied in this chapter is the way in 
which the aims of Belson are carried out, that is, the method of splitting .  A decision tree 
procedure either uses binary splitting where the data is segmented into two groups, or uses 
multi way splitting ,  where the data can be split_ ipto more than two groups. These splits can 
either be carried out on a single variable, called univariate splits (US), or on a l inear 
combination of variables. 
Figure 3 . 1  illustrates the method of binary splits for Fisher's Iris data (data set H from Table 
7 . 1 )  involving three species of iris (L virginica, L setosa and L versicolor) each with 50 cases 
and measurements taken on four variables (sepal length, sepal width, petal length and petal 
width). Although the problem involves four variables, only two variables are used to form 
the tree using the CART algorithm (see Section 3.7) .  The first split is  on petal length and 
23 
24 
asks the question as to whether petal length < 1 .95,  and if so, observations are sent to the 
right. It would be possible for the next split also to be on petal length, but here the next split 
is whether or not petal width < 1 .75, for those cases where petal length > 1 .95.  
The tree produced by CART is shown in Figure 3.2.  As only two splits were made, there 
were three terminal nodes, where a node is defined as a subset of the data and a terminal node 
is a terminal subset of the data which is assigned to one of k classes. The terminal node at top 
left consisted of 50 class 1 (I. v irginica) t1owers and was classified as class 1 .  The terminal 
node at bottom left was classified as class 2 (I. setosa) and consisted of 49 class 2 and 5 class 
3 flowers (I. virginica). The terminal node at bottom right was classified as c lass 3 ,  
consisting of 1 and 45 in  c lasses 2 and 3 respectively.  Notice that there were 6 flowers 
overall misclassified by the classification tree. 
Figure 3 .3 illustrates the method of multiway splitting for the same data. The FACT 
algorithm (see Section 3 .9) was used to partition the data. In this case, there is only one split 
carried out, that being on petal width, but it is a three-way split, dividing the data into three 
subgroups. The first subgroup corresponds to the case where petal width < 0.787 (split l a) ,  
while the second subgroup corresponds to t1owers where 0.  787 < petal width < 1 .677 (split 
l b) ,  with the third subgroup corresponding to all those cases where petal width > 1 .677. The 
FACT tree is shown in Figure 3.4. Basically, the FACT tree has split the data into three 
homogeneous terminal nodes using only one multiway split, compared with the two needed in 
the b inary splits example. As with Figure 3.2,  6 t1owers have been misclassified by this 
c lassification tree. 
The above examples were both carried out using only one variable at a time. Figure 3.5 
provides an illustration of a linear combination split, when used with CART. The first split is 
the same as that of Figure 3 . 1 .  The second split, however, involves both petal length and 
petal width and asks the question, for those cases where petal length > 1 .95, as to whether 
0.209 * petal length + 0.977 * petal width < 2.5 1 .  The CART tree for this example is given in 
Figure 3.6. The major difference from Figure 3.2 is that no class 3 cases were misclassified, 
and two fewer cases overall were misclassified. 
,. o' ' '' ?? ? ? '- A O o  ' o  o o o - -?? ???? OOo 0 , , ,,.,_, 0 0 ? - ? ? ? , ? ?  , 0 ,? "' ''' o ? -? ? ? , , ... .. _ . ?-- - -.. -?-?- ?? ,_, 
Binary Spl i ts Example :  Plot of Petal Length agains t  Petal Width 
for Fi sher ' s  I r i s  Data using CART 
?? ?- ?? ? - . - -?pl i t _ 1 -?-- . -? ? ? - . . 
2 .5  -i r 
. .  ? ?? ? ? ? ? ? .
.
.
.
. ? ? ? ? ? ? ? ??c?? - ? ?? 
2 .0  -? l Cx::.tt c c cc i c c ! ? _ _ _ _ _ _ _ _ _ _ _  c_ _ _ _ _ _ lt..AL-\..L. _ _ c_ _c_ c _ _  c._ _ _ _ _ l sp' i t  2 
:? 1 . 5 _11 li B C ? 
ll= B B  C i 
? B _ B 
I 
? 1 . 0 --? l B 
? I I 
A ! 
o .5  -1 uJ ? 
o .o  ?A?Jr ? ---- -- -r-- --????? ???-r???? .. ??--?--?- -?-???r? ?- -- - - ---. .  ?- ?-?? ?-r .. - - ?- - ? - --rj 
1 2 3 4 5 6 7 
' 
Petal  Length A = ! . v i rg in ica ! 
B :::.-  l .  setosa  ! 
Figure 3. 1 L ... . .. _ .  - ?? -- ? . . . . . . . . . .. .. . . ..... . ..... . . _ . .  C = I .  vers ico lo r i ? - ?- __ .
. 
_ . _ . . . .. . .  . ,  . . .. _ . .... ... - . . .... - .. . . .... .... . . ... .. .... . - . .. ... - ?? - . . .  - . ? .?? . ?J 
26 
50 
50, 0, 0 
0 
n 
n ?  I 
r(t) 
petal length < 1 .95? 
petal width < 1 .75? 
n = sample size at each node 
54 
0, 49, 5 
0.09 
2 
ni = sample size for class i at each node 
46 
0, 1 ,  45 
0.02 
3 
r(t) = purity measure at each node (that is, the proportion of observations not from the class 
with the largest number of observations at each node) 
Circles represent decision nodes which have to be split on while rectangles represent terminal 
nodes which are assigned to a particular class given below the node. 
Figure 3.2: CART Tree for Fisher's Iris Data 
r???? . .. . . . .. . . . . ... .. . . . . . . . - ?- . . . .. . . . . . .. . . 
.
. 
. - . . ..
. 
. 
I I 
I 
I 
I 
I 
I 
I 
I 
I") 5 -? ;_ , 
Multhray Spl i ts Example:  Plot of Petal Length against  Petal Width 
for Fi sher ' s  Ir i s Data us ing FACT i .. .. .. . .. . . . . . . ... .. . .... . . . . .. . . . ..... . . . .... . . . . . .. . . . .. . . . . 
I I 
... .. . . ........ - - "  . ??- . .... . ... . .. .. , 
c c 
c cc 
c cc c 
i 
.. ...... .... . . . ... . . ...... .. . . . . . . . . .  l 
I I 
:S 2.0 
- t 
-
. 
. - . 
c 
? 1 . 5 ---1 c - .. . .  lSp l i t  1 b  i I 
- 8 B ?  c 
' I t 1 . 0 - ? B B 8 8 fr"'!:l? 
? ?-- . . . , . . . .... . . . . . ... - .. .... __  -- .. ... . .. . . . .. -... .. . . ... . . . . -.. --- . .... . . . . . .. . . . . . . . . . . - .  _,_ . .. .. . . .. . ? - - - ?-- - ?? ? ? - - ? . . .  ? - - - ?Sp l i t  1 o  
I 
0 .5  
I 0 . 0 .. . .  i 
I 
I ???u?e 3 .3  
AA 't..J!i A 
AA? A 
_ 
--r-- ------??------- - r ?-- ? -- -- ?-?- -. . - - 1- . ?-? -- ------ . . . . ... . T .. ? - - . - ... ..... ..... .. ... r--? ?- .... .. . . ... . ..... . .....  r-?- - -- -- ?-- ----,.i 
1 2 3 4 5 6 7 
Petal Length A : .-:: I .  v i  rg in ica 
B = l . se tosa 
C :::!: I .  ve r s  ico lo r  1 . .. - - - -- . ... . ... ....... . .. . .. . ..... . . . ... . .  J 
28 
I 
(0, 0.787) 
50 
50, 0, 0 
0 
1 
pe? width 
(0. 788, 1 .677) 
52 
0, 48, 4 
0.08 
2 
Figure 3.4: FACT Tree for Fisher's Iris Data 
n 
n ?  I 
r(t) 
I 
( 1 .688, oo) 
48 
0, 2, 46 
0.04 
3 
2.5  --i 
2 .0  - ?  
? I ? 1 . 5 -j 
? 1 . 0 -? 
0 .5  
I 
A I 
A AJ ? 
A JfJJ A 
B B 
'"' B  
- -- ?C 
- . . . . . . ??--. I 
c c c 
c c c 
c c  c 
c 
..... i Spl i t  2 
. . .. . . ... ... "1 
0 .0  A AA ' ---!_,..?-??-?? ? . ... ----???????- ..lr ..... ........ ..... ... -. . .  r_ . . . .  ____ ____ ...... . .....  ?" ? ... ... ... . ..... - ?  .. .. ? ?- ?r . .  -- . . .. . .  ??? . .. . . . . .. r ? ? . . .. .... .. ..... - --.. rJ 
I 
I L?i?ure 3 .5  . ? .  ... ... . . . . .. . 
1 2. 3 4 5 6 7 
Petal Length A :.::: I .  v i  rgin ica 
8 =-? 1 .  set  os a 
C :.:-:: ! .  vers  1 co !o r  
30 
50 
50, 0, 0 
0 
petal length < 1 .95 
n 
n ?  I 
r(t) 
1 0.209 * (petal length) + 0.977 * (petal width) < 2 .5 1 
46 
0, 46, 0 
0 
2 
54 
0, 4, 50 
0.07 
3 
Figure 3.6: CART Tree for Fisher's Iris Data - Linear Combination Split 
3.3 AID 
Author(s) 1 N Morgan and 1 A Sonquist (USA). 
Introduction The Automatic Interaction Detector was published in 1 963.  
The essence of the algorithm is the sequential application of a 
one-way analysis of variance model (Morgan and Sonquist, 
1963). The purpose of the program was to handle interactions 
and inter-correlations among the data in a more explicit way. 
Classification/Regression Designed to perform regression using a continuous dependent 
variable, although dichotomous dependent variables can be 
handled by transforming one of the two categories into a 
proportion .  
Tree Growth: The tree is grown on all the data set. 
- Splitting Method Sonquist ( 1 964), summarises the four steps in the tree growing 
procedure as follows: 
(i) Choose, for splitting, the node, t, with the largest total 
sum of squares, TSS1= L,y; - (I,yr)2/n. 
(ii) Split each variable, xj , into two subgroups such that this 
division leads to the biggest decrease in unexplained sum 
. . . . B SS ( -2 -2) -2 ot squares, 1.e. max1m1se j = n 1Y 1 + n2y 2 - n1 Y 1 . 
(iii) Partition vatiable Xm over node t where 
BSSm is maxi BSSj . 
(iv) Return to step (i). 
? 
- Type of Splits Binary splits are the only method used and are carried out on 
only one variable at a time. 
- Priors/Costs .No. 
- Stopping Rules Direct stopping rules are used. A number of different criteria 
exist for stopping tree growth: 
(i) TSS1 < R * TSS 1 , where R is a parameter 0 ::;; R ::;; 1 ,  and 
TSS1 is the total sum of squares for the whole sample. 
(ii) BSSj < Q * TSS 1 over all xt 0 ::;  Q ::;; 1 .  
(iii) Number of unsplit nodes > P. 
(iv) Sample size of each unsplit node < L. 
3 1 
32 
- Node Classification/Prediction 
Tree Pruning 
Validation Procedures 
Interactive Abili ty 
Graphical Ability 
One-Stage Optimality? 
Missing Values 
All observations within a node are assigned the average value 
for the response variable in that node. 
No. 
No. 
No. 
No. 
Fielding ( 1 977) describes the r-step lookahead option used by 
AID Ill, the then current version of AID. The earlier version of 
AID grew the tree on a sequential basis so that the prediction 
error was only minimised at each step of the analysis, that is, 
"stage by stage optimization". The r-step lookahead option is 
an attempt to improve on the stage by stage procedure. 
For m predictors there will be m tentative splits under 
consideration. The best splits for each predictor on these 
subgroups is then obtained. One now has m2 possible 
two-stage trees under consideration. This process could 
be continued for r stages with mr possible trees. Clearly 
this lookahead option could involve a tremendous amount 
of computation and information storage were it not 
restricted.  The current version of AID Ill limits the 
lookahead steps to three, including the first split (Ibid, 
p 249) .  
Morgan ( 1 993) ,  in a personal correspondence, noted that the 
repeated use of the lookahead feature failed to find any useful 
applications or examples, and was dropped in later versions of 
the program.  One might think it would find offsetting effects, 
as when young men and old women are more likely to go to the 
hospital, but the sequential strategy seems to uncover these too 
according to Morgan. 
Missing values are replaced by class means estimated from non?
missing values in the learning sample. 
Criticisms 
Examples in the Literature: 
The AID algorithm has been criticised by a number of authors, 
including Einhorn ( 1972), Doyle ( 1 973), Kass ( 1 975) ,  Doyle 
and Fenwick ( 1 975),  Kass ( 1980) and de Ville ( 1 990) . The 
principal reasons for this criticism are: 
(i) It  requires very large sample sizes, usually > 1 000 
observations. 
(ii) It does not take the intercorrelations among the predictors 
into account. 
(iii) It  is  not robust to deviations from normality in the 
variables. 
(iv) The tree size is affected too much by noise in the data. 
(v) Only binary splits are carried out. 
(vi) Most importantly, there is no validation of the prediction 
rules constructed , either by testing for significance, or 
using an independent test sample. 
Morgan ( 1993) has responded to these criticisms. He believes 
that (ii) is wrong, except that once a split is made on one 
predictor, it may leave groups where a second predictor has lost 
whatever power it had, but that information is useful to know. 
For example, in searching for what makes people happy, the 
program splits first on the quality of the network of friends, then 
on health, and only then on income ! Of criticism (iii) ,  Morgan 
affirms that this is true of any least squares procedure, though 
AID alerts the user to isolated cases by splitting these off into a 
separate subgroup. Problem (v) is irrelevant according to 
Morgan, since multiple splits on the same predictor are 
possible, and it is wasteful to start with k subgroups when k- 1 
will do. The loss of information from grouping data is small, 
and a very few subgroups contain almost all the information. 
The last criticism , c laims Morgan, is not a function of the 
program but of the user, who can always grow the tree on three 
quarters of the sample and see how well the final  groups 
account for the variance in the other quarter. It  must be noted, 
however, that it is wasteful to not use all the data in the tree 
growing phase and test sample estimates of  error are highly 
varible for small samples. (See Section 4.2 and B reiman et al, 
1 984.) 
Assael, H ( 1 970). Segmenting markets by groups purchasing 
behaviour: An application of the AID technique, Journal of 
Marketing Research, 7, pp 153- 1 58. 
Heald, J I ( 1 972).  The application of the automatic interaction 
detector programme and multiple regression techniques to 
assessment of store performance and site selection, Operational 
Research Quarterly, 23, pp 445-457. 
Muxworthy, D T ( 1 972) .  Review of AID Ill ,  B ri tish 
Sociological Association Maths,  Statistics  and Computing 
Applications Group Newsletter, 9. 
33 
3.4 THAID 
34 
Author(s) J N Morgan and R C  Messenger (USA). 
Introduction Developed in 1 973, THeta AID was designed as an extension of 
the AID algorithm (Morgan and Sonquist, 1 963) to handle 
categorical dependent variables .  It was " . . .  v iewed as a 
simplified version of the present AID". (Messenger and 
Mandell, 1 972, p 1 8 .)  
Classification/Regression Designed for classification specifically for use on nominal ly 
scaled dependent variables. 
Tree Growth : The tree is grown on all the data set. 
- Splitting  Method Two methods for splitting are used. 
(i) Theta criterion or what Messenger and Mandell (ibid, 
p 1 2) ,  call "optimal prediction-to-the-mode strategy". 
The objective is to find the split at the unsplit node t 
which maximises: 
where nt = total number of observations in node t 
ni = total number of observations in the ith split 
group 
mi = total number of misclassified observations 
in the ith split group. 
Or else, the Delta criterion could be used. Messenger and 
Mandell, (ibid, p 1 5 ) ,  define this as " . . .  based on the 
simple notion that one should find  split groups whose 
probability distributions d i ffer maximally from the 
original group and hence from each other". 
The basic idea is to find the split on the variable for 
which 
k k 
byfx = nl L lpi - P lj l + n2 L lpj - P2j l . 
i= l  . j=l 
- Type of Splits 
- Priors/Costs 
- Stopping Rules 
where 
and 
Pj = proportion of observations from class j in  
node t ,  j = 1 ,  . . . , k 
P lj = proportion of observations from class j in 
split group 1 .  
Note that the authors o f  THAID recommend the Delta 
criterion for splitting i f  the ratio of sample size of the 
largest group to the second largest group is greater than 
2: 1 (Messenger and Mandell, 1 973). 
Only binary splits are carried out using only one variable at a 
time. 
No. 
Direct stopping rules are used. Stop if: 
(i) n.J2 < nmin? where nmin is a preset parameter, and 
(ii) either 8y/x < 8min or oy/x < omin? 
- Node Classification/Prediction 
Tree Pruning 
Validation Procedures 
Interactive Ability 
Graphical Ability 
One-Stage Optimality? 
Missing Values 
Assign a terminal node to the class with the largest number of 
observations in that node. 
No. 
No. 
No . 
No. 
Yes. 
Missing values are replaced by class means estimated from non?
missing values in the learning sample. 
35 
Criticisms Most of the cntiCISms levelled at AID, are also valid for 
THAID. Basically the method does not know when to stop. 
Kass ( 1 980) , p 1 20,  also states that, "[k]nowledge of the 
theoretical behaviour of the Theta criterion is lacking still". 
Morgan ( 1 993) has written in, noting that there is a maximum-
likelihood x2 splitting option available in the new SEARCH 
program which has replaced THAID. The procedure is 
designed to maximise stability and remove the chance of erratic 
results. 
Examples in the Literature Morgan, J N ( 1990). A conditional analysis of movers ' housing 
responses. Journal of Economic Behaviour and Organisation. 
3.5 ID3 
36 
Author(s) J R Quinlan (Australia) 
Introduction Introduced in 1 979,  this procedure is in the family of recursive 
partitioning, tree-based algorithms, although it is from the 
machine learning rather than the statistical literature. Quinlan 
( 1 983) ,  describes the method as " . . .  recover[ing] valuable 
information from large masses of low grade data by a process of 
inductive inference". 
Classification/Regression Handles classification problems only. 
Tree Growth: A subset of the original learning sample, called a 'window' is 
chosen at random and a decision tree formed that correctly 
classifies all observations in the window. All objects in the 
learning sample, but not in the window are then classified using 
this tree. If the tree gives the correct c lassification for all 
objects then this tree is declared optimal, otherwise some more 
observations are added to the window with the tree-growing and 
evaluation process being repeated. The process continues until 
all cases in the learning sample are correctly c lassified. 
- Splitting Method Splitting is achieved by means of an information measure. An 
observation is determined to belong to class 1 with probability 
p 1 = m/(n+m) and to class 2 with probability P2 = n/(m+n), 
where m and n are the number of observations from class 1 and 
class 2 respectively. The expected information needed to 
classify an object using a tree is :  
- Type of Splits 
- Priors/Costs 
- Stopping Rules 
I(m, n) = -P l  log2 P I  - P2 log2 P2? 
Let a variable, xj , considered for partitioning, contain v distinct 
categories {A  1 , . . .  , Av } .  The node t that is to be considered for 
splitting will be split into v descendant nodes, t1 , . . .  , tv, each 
described by one particular category of xj ? The information 
required for the subtrees with ti is I(mi, ni), where mi and ni are 
the number of class 1 and 2 observations in the ith node. The 
expected information required for trees partitioned on xj at the 
root node is 
V 
E(xj) = L [(mi + ni)/(m+n)] * I(mi, ni) . i= l  
where the weight of the ith branch i s  the proportion of objects in 
t that belong to lj. Information gained by branching on xj is 
ID3 examines all variables, xj , j = 1 ,  . . . , p ,  and chooses Xj to 
maximise gain(xj) ?  This process is continued on the recursively 
found nodes, t 1 , . . . , tv. It is known as the gain  criterion. 
Multiway splits are used here. In fact, splitting is carried out 
using every possible value of a variable. If a predictor variable 
is continuous, some form of clustering of the values is carried 
out before splitting. Only univariate splits are carried out. 
No. 
(i) Stop when all cases in the learning sample are correctly 
classified. 
(ii) An alternative stopping rule is: Use the x2 statistic to 
determine if the categories of variable xj are independent of those class of objects in S. No further testing of the 
variables (splitting) is done if that variables irrelevance 
cannot be rejected at a very high confidence level. 
- Node Classification/Prediction 
Assign a terminal node to the class with the largest number of 
observations in that node. 
Tree Pruning No. 
Validation Procedures No. 
37 
Interactive Ability 
Graphical Ability 
One-Stage Optimality? 
Missing Values 
Criticisms 
Examples in the Literature 
38 
No. 
No. 
Yes. 
Observations can either be discarded from the data set before 
splitting, or, alternatively,  use the ratio of class sample sizes 
multiplied by what Quinlan calls a ' token' to find a p redicted 
value of the variable for which a particular observation is 
missing. 
Many of the problems inherent in AID have been also found 
present in this algori thm.  deVille ( 1 990) discusses the 
following problems. 
(i) Biased towards the selection of variables with many 
categories though they may not be the best predictor. 
(ii) Do not know when to s top. 103 continues splitting on 
nodes with only a small number of observations. The 
resultant decision tree would not hold up in the real world, 
being principally a function of the data at hand. 
(iii) Overly large trees are too complex and not easy to 
understand. 
Quinlan e t  al ( 1 986) , p 1 64, s ta tes that " [e] mpirical 
investigations have found that trees generated from such sets 
are usually simpler and more accurate than those constructed 
from random samples" . However, in  the next paragraph, it is 
argued that " . . .  decision trees produced by any top-down 
approach are more complex than can be justified by the data." 
(lbid, p 1 64). 
Schwartz, S ,  Wi les , J ,  Gough, I and Phillips , S ( 1 993) .  
Connectionist, rule-based and  Bayesian decision aids: an 
empirical comparison, in "Artificial Intelligence Frontiers in 
Statistics", D J  Hand (ed) ,  London : Chapman & Hall,  pp 264-
278. 
3.6 CHAID 
Author(s) G V Kass (South Africa) 
Introduction Developed in 1 980 as an offshoot of AID for use with 
categorical response variables. It was designed to tackle the 
criticisms of AID by " ... embedding the partitioning problem in 
a significance testing framework" (Kass, 1980, p 1 20) . Known 
as CHi-squared AID. 
Classification/Regression Handles only classitication problems. 
Tree Growth: The tree is grown on all the data set. 
- Splitting Method According to Kass, the splitting method proceeds as follows: 
(i) For each predictor variable in turn, with the dependent 
variable having k classes, cross-tabulate the categories of 
the predictor with the dependent variable. Go to step (ii). 
(ii) Find the pair of categories of the 2*k subtable that are 
least significantly different. Merge the two categories into 
one compound category if the significance does not 
exceed a critical value. Repeat this step until no more 
mergers can be found. 
(iii) For each compound category consisting of three or more 
of the original categories, find the most significant binary 
split into which the merger may be resolved. If the 
significance exceeds a critical value, implement the split 
and return to step (ii). 
(iv) Calculate the significance level of each optimally merged 
predictor and isolate the most significant one. If this 
significance is beyond a cri terion value, sp li t  the data 
according to the merged categories of the chosen 
predictor. 
(iv) Return to step (i) for each, as yet, unsplit node t. 
- Type of Splits Multiway splitting can be used with this method, for example, 
three-way, four-way or larger splits. Single variable splits only 
are carried out. 
- Costs/Priors No. 
39 
- Stopping Rules 
Tree Pruning 
Validation Procedures 
Interactive Ability 
Graphical Ability 
One-Stage Optimality? 
Missing Values 
Criticisms 
40 
Stop if: 
(i) ns < nmin? 
(ii) The split on the optimally merged predictor < X?g-l) , a on 
g compound categories for a preset value of a. 
No. 
No. 
A current version of CHAID runs on  SPSS for windows. A 
series of menus with a mouse button allow the user to set values 
for the parameters used in the tree-growing process, and begin 
the analysis. The tree-growing process can be interrupted at 
any point and the values of the parameters altered. 
The graphical display of the tree works in unison with the tree?
building p rocess. As a node is split into a number of sub-nodes, 
the results are displayed immediately on screen by means of a 
decision tree. The level of detail about each node and each split 
carried out can also he altered. 
Yes. 
These are excluded from the tree-growing process. 
No criticism of CHAID directly has been discovered in the 
literature. The weakness of the method, however, would appear 
to be the lack of any procedure for validating the results. As 
Einhom ( 1 972) ,  p 368,  stated,  eight years before CHAID 
appeared, " . . .  the results should" be subjected to a more rigorous 
criterion  than statistical significance or some other statistical 
. criterion". He noted, "replication is the backbone of science 
and when techniques capitalise on chance fluctuations in the 
particular sample at hand, it is imperative to replicate (or cross?
validate) the results on a new set of cases" (lbid, p 368) .  
Another possihle criticism of CHAID is  the use of  a direct 
stopping rule. The authors of CART, Breiman et al ( 1 984),  
criticised this type of stopping rule on the following basis. If 
the significance level is set  too high (large p-values), then there 
is too much splitting so that the tree is too large and just a 
reflection of the sample data. If the significance level is  too 
Examples in the Literature 
3.7 CART 
low, then one may cease splitting too early and declare a node 
as terminal when there still existed splits with large decreases in 
impurity. 
Author(s) L Breiman, J H Friedman, R A Olshen and C J Stone (USA). 
Introduction Breiman et al began work on recursive partitiOning in the 
1 970's. Their work was completed with the publication of the 
CART monograph (Breiman et al, 1984). The purpose of the 
algori thm had the dual goals of providing a set of accurate 
decision rules in the form of a tree that were easily intepretable 
while seeking to solve the problems inherent in some of the 
earlier methods of the above type, such as AID and THAID. 
Classification/Regression Standing for Classification and Regression Trees, CART 
handles both numeric and categorical response variables. For 
comparison with the other methods and simplicity, everything 
henceforth will be described in the classification context only. 
Tree Growth: The tree is grown on the whole data set. If, however, the data 
set is overly large, a tree can be grown on only a subsample of 
the data. 
- Splitting Method At the root node of the tree, the splitting variable is chosen to 
maximise the class puri ty,  that is, as many observations as 
possible are from the same class, of the two descendant nodes, 
these being the two sets of points that went either left or right 
when the chosen variable was split, as well as aiding the future 
growth of the tree. 
Two different criteria are available in CART to achieve the 
above two aims, namely the G ini and twoing splitting criteria. 
Breiman et al ( 1984) have found that the final classification tree 
generated is fairly insensitive to the choice of a splitting rule. 
The G ini splitting criterion works in the following way. 
Suppose at a node t, an object can be assigned to class i with 
probability p(i/t) while the estimated probability that the object 
is in c lass j is p(ilt), then the estimated probability of 
m isclassification under the Gini index is 
41 
- Type of Splits 
- Costs/Priors 
42 
i (t) = I, p(ilt) p(i/t). i;tj 
The Gini cri terion seeks to maximise the function 
!1i(s, t) = i(t) - PLi(td - PRi(tR) 
where PL and PR are the proportion of observations at node t 
sent left and right respectively by the split. 
The twoing criterion seeks to amalgamate the set of k c lasses 
into two superclasses, C1  and C2. The measure of goodness-of-
split, !1i(s, t), is computed as if it were a two class problem. 
"The idea is then, at every node, to select the conglomeration of 
classes into two superclasses so that considered as a two-class 
problem, the greatest decrease in node impurity is realised". 
(Breiman et al, 1984, p 105) .  The twoing splitting rule thus 
max1m1ses 
<l>(s/t) = PL I, lp(j/td - p(j/t)l + PR I, I p(i/tR) - p(i/t)] . 
j j 
In general ,  Breiman et al state that Gini tends to split into one 
small pure node and one large impure node, that is, to separate 
the classes out one at a time. Breiman et al call this end cut 
preference. In contrast, twoing favours splits that tend to make 
the two descendant nodes as pure as possible i n  the two 
superclasses. "It gives strategic splits and informs the user of 
class similarities . . . .  [ It] attempts to group together large 
numbers of classes that are similar in some characteristic [near 
the top of the tree] . . .  [and] attempts to isolate single classes 
[near the bottom of the tree] ." (ibid, p 105) .  The twoing 
criterion is in fact the same as the delta criterion used in THAID 
(see Morgan and Messenger, 1 973).  
CART is a binary recursive partitioning algorithm that sends a 
case either left or right. Univariate splits using both ordered 
and categorical variables can be carried out as well as linear 
combination splits for ordered or quantitative variables only. 
Both priors and misclassification costs can be varied and 
incorporated into the CART tree building process. Priors can 
be varied to take account of samples that are not representative 
of the populations from which they came, that is, the c lass 
sample proportions are different from the class proportions in 
the population. Misclassification costs can also be varied to 
take into account instances whereby it is more serious to 
misclassify some class(es) than other class(es). At a node t, the 
- Stopping Rules 
estimated probability of misclassification using the G ini index 
lS 
? C(ilj) p(i/t) p(j/t). 
I '  1 
The process is terminated only when all nodes that have not yet 
been split are pure, or if the node size for all unsplit nodes falls 
below a specified value. 
- Node Classification/Prediction 
Tree Pruning 
A node is assigned to the class with the largest number of 
observations in that node, in  the case of priors proportional to 
sample size and unit costs. If either priors or costs are varied, 
then classification of a node must also take into account the 
values of the costs of misclassification and class priors. 
It is clear that CART' s stopping rule produces a tree that could 
be very large, with an overly optimistic error rate and many of 
the splits near the bottom of the tree occurring only because of 
noise in the data. To guard against this possibility, CART 
employs a backwards recursive node recombination or pruning 
algorithm on the completed tree. The algorithm proceeds as 
follows. For any sub tree T of T max ? where T max is the fully 
grown tree, define the cost-complexity measure, Ra(T) as 
Ra(T) = R(AT) + a L(T) 
where R(AT) is the resubstitution estimate of the accuracy of 
the sub tree, L(T) is the number of terminal nodes in T and a ? 0 
is the complexity parameter. For each value of a find the 
subtree T a that minimises Ra(T) above. In practice, each 
successive pair of descendant nodes is recombined and an 
estimate of accuracy is compared for the split/unsplit situations 
using the cost-complexity function above. A larger penalty is 
assigned to a l arger sized tree. If there is no improvement in 
accuracy then the two descendant nodes are recombined and 
tested for accuracy in the same manner. If there is some 
improvement from splitting, this subtree is retained and the 
process continues trying to find the next smallest subtree which 
produces an increase in accuracy through splitting. The end 
result is a sequence of trees with decreasing size and increasing 
resubstitution error rate. 
A ' honest' sized tree can be obtained by  either running an 
independent test sample down this sequence of subtrees and 
selecting the tree having the minimum error rate or using g-fold 
cross-validation, 2 ? g ? n. With cross-validation in the CART 
43 
Validation Procedures 
Interactive Ability 
Graphical Ability 
One-Stage Optimality? 
Missing Values 
44 
context, a tree of maximum size is grown in each of the sets of 
size (n - (n/g)) and the pruning algorithm is carried out as for 
the original sample. Then, each of the g sets of omitted 
observations (size = n/g) can be used as an independent test 
sample for the sequence of subtrees created by the pruning 
algorithm. The sizes of each of the g trees are averaged and the 
tree from the learning sample that is  c losest in size to the 
average of the chosen cross-validated trees is then selected. 
Breiman et al recommend using a measure of error as well, 
whereby 
se(R(i)) = (R(i) ( 1 - R(i))ln] l/2 
is the standard error estimate for the test sample or cross?
validated error rate R(T). The idea is to choose the smallest tree 
within one standard error of R(t). This was recommended to 
reduce the size of the decision tree created as it was found that 
independent error rate estimates were fairly constant over quite 
a wide range of tree sizes. 
As seen above, test sample validation and cross-validation are 
used by CART to select the "right-sized" tree. These two 
techniques are also used to estimate the true error rate of the 
prediction rules obtained . 
CART by Systat provides an enhanced version of CART that is 
more user-friendly than the original. There is some interaction 
with a menu system, however a whole tree must be grown 
before any alterations can be made. CART, though, does not 
allow you to change the splitting variable at an intermediate 
stage of the tree-growing process. 
CART by Systat produces files that can be displayed by an 
independent graphics program , after the CART analysis has 
been carried out. This is not possible in the original CART. 
Yes. 
Missing values are handled by what Breiman et al call surrogate 
splits. This is defined as follows. Suppose that s * is the optimal 
partition of a node t into tL and tR. If a split, sj, is carried out on 
a variable, xi , then the probability that Sj sends the cases in t the * . same way as s IS 
Criticisms 
p(s*, s_;) = PLL(s*, sj) + PRR(s*, sj) 
where 
and t' Lis the set of observations sent left by sj . A surrogate 
split, Sj on xj, occurs if 
Breiman et al detine a surrogate split as the split on Xj that most 
accurately predicts the action of s*. 
This then leads to the use of surrogate splits with m1ssmg 
values. If a case has a missing value for the splitting variable, 
so that s * is not defined for that case, then for all the non?
missing variables for that case, find the best surrogate split, S'j, 
and split the case using s_i . 
CART could be considered as a 'watershed '  in the development 
of tree-based methods in that it veered away from the direct 
stopping rules of previous decision tree-based methods and 
adopted the approach of 'grow an overly large tree, then prune 
and validate ' .  In  taking this approach it set a benchmark for 
future methods to build on, as well as being open to criticism .  
Loh and Vanichsetakul ( 1988) criticise CART on the following 
bases: 
(i) Based on sort and search principles. 
(ii) Typically no more accurate than LDA. 
(iii) Too slow if cross-validation is employed. 
(iv) Uses only binary splits. 
(v) The cross-validation estimate of error is not genuine as it  
was also used to select the tree size. 
(vi) Produces different results when the variables are 
transformed. 
Breiman and Friedman ( 1988) answered all these criticisms as 
well as crit ic is in g the FACT program of Loh and 
V anichesetakul. 
Quinlan ( 1987) criticises the pruning algorithm employed by 
CART. First, he believes that there is no valid reason why the 
cost-complexity model should be favoured over any other. 
Second, he does not know why the sequence of subtrees 
produced should be abandoned after selection of the best tree. 
45 
Examples in the Literature 
46 
Lastly, he feels that the use of c ross- validation IS 
computationally expensive. 
In an empirical s tudy using a wide variety of data sets, 
comparing a number of tree-based as well as statistical 
algorithms and neural networks, Feng et  al ( 1 993) found that 
CART performed rather well over all the data sets. They also 
found that CART tended to produce the smal lest sized trees, 
hence the simplest trees. Also on the positive side, " . . .  [the 
introduction otl a cost-handling mechanism in the testing phase 
(in CART) can make a visible improvement compared to, say, 
C4.5". (ibid, p 5 1 ) .  Feng et al found that CART produced 
results much closer to those produced by traditional statistical 
algorithms than other tree-based methods.  They suggested that 
this could be due to the fact that CART incorporates a cost 
structure. 
On the negative side, they found that CART's  pruning 
algorithm was not all that efficient and wrongly assumed that 
there was a single global parameter for the amount of pruning to 
be done. As well ,  they found that CART can prune too heavily 
if the one standard error rule is used and there is very little noise 
in the data. Using the zero standard error rule, however, can 
mean that trees are too large if there is noise in the data. 
Grajski, K A, Breiman, L, Viano Di Prisco, G and Freeman, W 
J ( i 986).  Classification of EEG spatial patterns with tree?
structured methodology: CART, IEEE  Transactions on 
B iomedical Engineering, 33, pp 1076- 1 086.  
Ildiko, E F and Lanteri, S ( 1 989) .  Classification models: 
discriminant analysis, SIMCA, CART, Chemometrics and 
Intelligent Laboratory Systems, 5, pp 247-256. 
Crawford, S L and Souders. S K ( 1 990).  A comparison of two 
conceptual clustering algorithms, International Journal of 
Pattern Recognition and Artificial Intelligence, 4, pp 409-420. 
3.8 C4.5 
Author(s) J R Quinlan (Australia). 
Introduction Published in 1 986, C4.5 is a descendant of ID3 (Quinlan, 1 979). 
It is described by Quinlan et al ( 1 986), p 1 57 , as ". . . a new 
inductive inference tool that is capable of dealing with large 
volumes of messy, real-world data". Unlike ID3 ,  though, the 
tree-growing process is followed by a number of pruning 
procedures. In addition, a larger range of options and parameter 
settings is available in C4.5 than ID3. 
Classification/Regression Handles classification problems only. 
Tree Growth: The first stage of the process is very similar to ID3. According 
to Quinlan et al ( 1 986) ,  a subset (approximately 1 0% )  of the 
learning sample is chosen at random. This subset is known as a 
working set. A decision tree is grown on the working set. The 
remaining 90% or so of cases in the learning sample are 
classified using this decision tree. If all the observations from 
the learning sample are correctly classified,  then the process 
stops and the decision tree is satisfactory. Otherwise, another 
set of observations from the learning sample is added to the 
working set and a completely new tree is grown. 
- Splitting Method The gain criterion, as used by ID3, can also be used in C4.5. An 
alternative is the gain ratio criterion. If a variable, xj , has v 
distinct values then v possible descendant nodes can be found 
from splitting on xi . The information measure or 'correctness of 
the answer' , IV(xi) from splitting on xj is found by 
V 
IV(x?) = - I. 
mi + ni log mi + ni .1 ? 1 m + n  
2 m + n  I= 
where m and n are number of observations from class 1 and 
class 2 respectively, while mi and ni are the number of class 1 
and class 2 observations in  the ith node. Let the expected 
information content from a split on Xj be defined as 
v m?  + n? 
E(x?) = I. 1 1 IV(x?) . 
.1 i= l  m +  n J 
47 
48 
- Type of Splits 
- Priors/Costs 
- Stopping Rules 
The gain in splitting on xj is: 
The gain ratio criterion chooses the variable with the maximum 
ratio of gain(xj)/IV(xj) to split on, subject to a number of minor 
constants. 
Multiway splitting, as in ID3 ,  is carried out. Splitting is done on 
every distinct value of the splitting variable. Only univariate 
splits are carried out. 
No. 
The process is halted, when after a few iterations of the tree?
growing process, no decrease in misclassification error rate has 
been observed.  This would usually occur if there were 
inconsistencies in the learning sample. 
- Node Classification/Prediction 
Tree Pruning 
A node is assigned to the class with the maximum number of 
observations in that node. 
In contrast with Breiman et al ( 1 984), C4.5 uses pessimistic 
prun ing,  to decide tree size. Quinlan ( 1 987),  defines the 
method as follows. Let T be a subtree of the tree T max , 
containing L(T) terminal nodes and letting IK and IJ be the 
total number of observations and number of misclassified 
observations respectively in subtree T. A pessimistic view of T 
is that it will misclassify L= CIJ + L(T)/2) out of the IK unseen 
cases, with standard error 
(L) _ ?LCIK - L) se - I X  . 
The above involves using the continuity correction as in 
binomial probabilities. Let E be the number of observations 
misclassified by the best terminal node within T. The 
pessimistic pruning algorithm replaces T by the best terminal 
node whenever L = IJ + L(T)/2 is within the limits of 
L + se(L). A number of repetitions of the tree growing and 
pruning process is carried out. As the i ni tial working set is 
selected purely at random, the same learning sample can give 
rise to completely different trees, as completely different parts 
of the learning sample may be chosen. The pruning process 
selects the best trees based on a combination of l ow 
misdassification error and small tree size. Quinlan ( 1987)  
states the following as the two prime advantages of pessimistic 
pruning. 
Validation Procedures 
Interactive Ability 
Graphical Ability 
One-Stage Optimality? 
Missing Values 
Criticisms 
(i) Faster than other pruning methods. 
(ii) Does not need a test sample distinct from the learning 
sample. 
The validity of the second advantage is questionable as some 
form of bootstrapping or cross-validation could be used in the 
prumng process. 
Validation of the rules is carried out after the creation of the 
decision trees. Validation by an independent test sample is 
done by dividing the data into a test sample and a learning 
sample before analysis. 
No. 
No. 
Yes. 
Either omit observations with missing values from the analysis 
or use class sample sizes multiplied by some parameter to 
estimate missing values. 
Recent studies by Feng et al ( 1 993) and Schwartz et al ( 1993) 
have compared C4. 5 with other classification methods. 
Schwartz et al found that C4.5 was very robust to noise in the 
data, produced sensibly sized trees and provided new insight 
into a particular set of data by uncovering important 
relationships among the variables. C4.5 ,  however, due most 
probably to its creation in the machine learning environment, 
has no mechanism for incorporating a cost structure. Schwartz 
et al note that C4 .5 produced large d ifferences in group 
misclassification error rates. Feng et al ( 1993)  carried out a 
more thorough study than Schwartz e t  al .  Their results 
tentatively showed that C4.5 produced larger trees than CART, 
hence produced rules that were biased towards the learning 
sample. They note " [r]eliability is negatively related to the 
difference between [ learning] and testing accuracy" (Feng et al, 
1 993, p 48) .  As a tree increases in size, node sample s izes 
decrease so rules are being generated from smaller and smaller 
sized samples. This makes these rules less reliable. They 
conclude by saying "[as] our tests show, even simply 
introduc[ing] a cost-handling mechanism in the testing phase (in 
CART) can make a visible improvement compared to, say, 
C4.5.  
49 
Examples in  the Literature 
Quinlan, J R, Compton, P J, Horn, K A and Lazarus, L ( 1 986). 
Inductive knowledge acquisition: a case study, in "Applications 
of Expert Systems, Volume 1", J R Quinlan (ed) ,  Wokingham: 
Addison-Wesley, pp 1 57- 1 83 .  
Schaffer, C ( 1 993) .  Selecting a classification method by cross?
validation, Personal Communication. 
3.9 FACT 
50 
Author(s) W Y Loh (USA) and N Vanichsetakul (Thailand). 
Introduction Published by Loh and Vanichsetakul in 1988  at the University 
of Wisconsin, the ful l  title of the program is Fast Algorithm for 
Classification Trees. The goal of the procedure is an algorithm 
sharing the best features of LOA and CART, namely the speed 
of linear techniques and the readily comprehendable structure of 
decision trees. 
Classification/Regression FACT deals with categorical dependent variables only. 
Tree Growth: The tree is grown on the whole data set. 
- Splitting Method Three splitting algorithms are used by FACT. The first deals 
with univariate splits . Univariate F-ratios for variable selection 
are used at each node, to obtain the variable with the highest F?
ratio for splitting,  and then carrying out LDA on the selected 
variable to partition the co-ordinate axis. If the largest F-ratio 
of between to within-class variance is less than a specified 
threshold, Fo.  no split is formed and the node is declared 
terminal. 
Linear combination splits can also be generated by FACT using 
principal component analysis of the correlation matrix at each 
node. Then, LDA is carried out on the scores of the m largest 
principal components with m depending on user input. Loh and 
Vanichsetaku1 ( 1 988)  prefer linear combination splits over 
univariate splits .  
A third method of spl itting can be used whenever spherical 
symmetry is detected in a node, whereby univariate and linear 
combinations would he ineffective. Polar coordinate splits 
solve this problem,  which involves transforming the best 
- Type of Splits 
- Priors/Costs 
- Stopping Rules 
splitting variable x? after subtracting the mean Yi from each 
observation Yii ?  and splitting on the resulting transformation, 
where Yij is the ith principal component score for the j th 
observation. 
Multiway splitting can be used by FACT, d ividing a node into 
two, three or more descendant nodes. As mentioned previously, 
splitting can be canied out using only one variable at a time, or 
a linear corn bination of the variables. 
As FACT uses LOA to split each node, both different priors and 
cost matrices can be incorporated into the tree building process. 
A direct stopping rule is used to determine tree size. Spli tting is 
stopped if the error rate found from resubstituting the original 
sample does not decrease with splitting or the node size falls 
below a certain value. 
- Node Classification/Prediction 
Tree Pruning 
Validation Procedures 
Interactive Ability 
Graphical Ability 
One-Stage Optimality? 
Missing V a lues 
Criticisms 
A node is assigned to the class with the largest number of 
observations in that node, unless p riors and/or costs are altered. 
No. 
The final decision tree generated in FACT can be val idated by 
g-fold cross-validation, 2 :::; g :::; 25 .  
No. 
Draws trees using Splus functions. 
Yes. 
Missing values are replaced by class means estimated from non?
missing values in the learning sample. 
The principal criticisms of FACT appeared in Breiman and 
Friedman ( 1988) .  They were: 
The authors of CART, in Breiman and Friedman ( 1988),  regard 
FACT as a step back in the evolution of binary dec ision trees. 
B reiman and Friedman criticise FACT on the fol lowing 
grounds. 
5 1 
(i) The principal motivation for FACT is computational. 
Running time is sacrificed for accuracy and simplicity. 
(ii) Linear corn bination splits are not better than univariate 
splits. In most cases where recursive partitioning has 
performed better than tradi tional parametric methods it 
has been through univariate splits. 
(iii) Top-down stopping rules, as used in AID, THAID, etc, 
were one of the main reasons why the above methods 
were not really recognised. "The optimal-complexity tree 
pruning algorithm (based on cross-validatory choice) 
implemented in CART is probably the most important 
contribution of B reiman et al ( 1 984)." (Breiman and 
Friedman, 1988,  p 726). 
(iv) FACT cannot handle categorical variables in a clean and 
elegant way. 
(v) FACT is not invariant under transformations of variables. 
(vi) There are no surrogate variables to handle missing values. 
Examples in the Literature Wolberg, W H, Tanner, M A, Loh ,  W Y and Vanichsetakul, N 
( 1987). Statistical approach to fine needle aspiration diagnosis 
of breast masses, Acta Cytologica, 31,  pp 7 3 1 -74 1 .  
3. 10  KnowledgeSeeker 
52 
Author(s) B de Ville, E Suen and D B iggs (Canada) . 
Introduction Released commercially in 1 989, KnowledgeSeeker is a decision 
tree package that according to its authors, " . . .  m ine[s] a database 
for i ts cri t ical dec is ion-making and p roblem -solving 
information" (de Ville, 1 990, p 30). The results are presented in 
a graphical display with an easy-to-use interactive ability which 
" . . .  provides both end users and specialists with high levels of 
interaction of accurate, i l luminating and reliable decision?
making information and knowledge based rules" (!bid, p 30). 
Classification/Regression Both numeric and categorical response variables can be 
handled. For comparison with the other methods and 
simplicity, everything henceforth will be described in the 
classification context only. 
Tree Growth : 
- Splitting Method 
- Type of Splits 
- Costs/Priors 
- Stopping Rules 
The tree is grown on the whole data set. 
KnowledgeSeeker seeks to overcome the problems inherent in 
AID and ID3 by using a significance testing approach to 
splitting as used in the CHAID program (Kass, 1 980) . Two 
alternatives exist for splitting using the significance testing 
approach. The first uses exhaustive partitioning, as in CART, 
searching over all possible combinations of values of every 
variable to find the split which maximises the x2 statistic with 
respect to the class variable. This method is guaranteed to find 
the optimal split for the data at hand based on statistical 
inference. The second approach is to use a heuristic clustering 
technique. Values of a particular variable are grouped with one 
another on the basis of their similarity in the response variable. 
This merging of values continues until no further merging is 
significant at a specified level of significance. Once values are 
merged, they c an be split again using a more stringent level of 
significance. This approach is not optimal, but de Ville ( 1 990) 
regards it as intuitive and appealing. 
Since the most alike values of a particular predictor are 
clustered together, multiple branches can accrue from the same 
node.  Tha t  i s ,  mul tiway partition ing  is used by 
KnowledgeSeeker. Splits are carried out on only one variable 
at a time. 
No. 
Splitting is stopped if either: 
(i) Node size falls below a certain value. 
(ii) The optimal split on a predictor at a particular node 
does not exceed a specified significance level. 
- Node Classification/Prediction 
Tree Pruning 
A node is  assigned to the class with the largest number of 
observations in that node. 
According to de Ville ( 1 990), KnowledgeSeeker supports both 
validation and tree pruning methods, that either verify the 
decision tree or which rate the quality of new branches on the 
decision tree and truncate them if its quality fails to pass a 
certain threshold valve. In the literature of  decision tree 
methods, the above is NOT a pruning method, but rather a set of 
tests used in the tree growing process. Thus KnowledgeSeeker, 
like CHAID, i ts closest ancestor, does not have a tree pruning 
method. 
53 
Validation Procedures 
Interactive Ability 
Graphical Ability 
One-Stage Optimality? 
Missing Values 
54 
Version 2.0 of KnowledgeSeeker does incorporate a validation 
procedure whereby part of the data is used to grow the tree 
while the other part of the data is used to test the rules created. 
It is relatively quick and easy to divide the data in two and use 
each half in turn as a learning sample and a test sample. 
Unfortunately, the latest version (2. 1 )  of KnowledgeSeeker 
does not contain any validation procedure. 
With the click of a mouse button, the user can investigate other 
possible partitionings of the decision tree at every step of the 
tree growing process .  KnowledgeSeeker automatically 
calculates the best alternate splits at every node, so the user can 
examine the effects on the tree by changing from the best split 
to the best alternate/second best alternate split etc, so as to " . . .  
correspondingly mould the creation of the decision tree or rule 
base to support their understanding of the problem area and 
decision making task at hand" (de Vil le, 1 990, p 30). In 
addition, the user has the choice of either growing the tree 
automatically or growing it on a node-by-node (stepwise) basis. 
All operations are started by the use of a pull-down menu. 
The graphical display of the tree works in unison with the tree 
building process. As a node is split into a number of sub-nodes, 
the results are displayed immediately on the screen by means of 
a decision tree. The user can interrupt the process to investigate 
the current state of the tree and then continue at the point where 
splitting was ceased. The level of detail about each node and 
each split carried out can also be altered. 
KnowledgeSeeker is, in theory, one-stage optimal . In practice, 
however, it is relatively quick and easy to investigate the effects 
on the tree of changing the partition of a particular node to one 
of the other significant partitions. 
KnowledgeSeeker handles m1ssmg values in two different 
ways: 
(i) They are exc luded from the decision tree growing 
process. 
(ii) They are treated as an additional category of a variable, 
and so can be combined with the categories that they most 
resemble. 
Criticisms 
Examples in the Literature 
3. 1 1  Splus Trees ( ) 
Very few, if any, reviews of KnowledgeSeeker have appeared 
in the literature. Biggs et al ( 199 1 )  conducted some simulation 
studies with different significance levels. They found that 
KnowledgeSeeker can confidently be used with either smal l  or 
large data sets involving categorical predictors and a response. 
They also found that the same confidence applied with 
cont inuous responses, provided that the response was 
ap?roximately normally dis tribu ted with roughly equal  
vanances. 
One major criticism that could be made of KnowledgeSeeker is 
that i t  moves away from any form of validation of the results, 
by independent test samples. As Breiman et al ( 1984) state, the 
use of the resuhstitution estimate of error rate as an estimate of 
the true error rate can give an overly optimistic picture of the set 
of  decision rules constructed.  The omission of a validation 
procedure goes against current statistical practice in the decision 
tree field and even contradicts what was written in de V ille 
( 1 990). In a personal communication, de Ville ( 1994) affirms 
that the forthcoming version of KnowledgeSeeker does support 
a hold hack sample and validation facility. 
Author(s) L Clark (USA) and D Pregihon (Canada). 
Introduction Developed in 1 99 1  using the Splus language to c arry out a 
CART-like decision tree modell ing method. Of all the decision 
tree-based methods, the Splus tree routines are the closest to 
those used by CART with binary recursive partitioning, pruning 
and cross-validation. 
Classification/Regression Splus trees() handles both categorical and numeric dependent 
variables hence can he used for classification and regression 
problems.  For comparison with the other methods and 
simplicity, everything henceforth will be described in the 
classification context only. 
55 
56 
Tree Growth: 
- Splitting Method 
- Type of Splits 
- Priors/Costs 
- Stopping Rules 
The tree is grown on the whole data set. 
The deviance function for an observation Yi is defined as (Clark 
and Pregibon, 1992). 
k 
D(?i, Yi) = -2 _2: Yij log(Ytj) j= l 
that is, negative two times the log-likelihood function, where 'Yij 
denotes the probability that the ith response falls in j th class. 
At a given node, the mean parameter ? is  constant for all 
observations. The maximum likelihood (or minimum deviance) 
estimate of ? is given by the node proportions. The deviance of 
a node is defined as the sum of the deviances of all the 
observations in the node, 0(11; y) = I D(P.; Yi) ? A node where 
all the observations belong to the same class will have a 
deviance of zero. 
Splitting is achieved by comparing the deviance of the current 
node to that of the two descendant nodes, where the combined 
deviance of the two descendant nodes is 
and the split that maximises 
is the split used at the given node. 
Only binary splits are used, and only on one variable at a time. 
No. 
Splitting is stopped by one or two different rules. The first sets 
a minimum node s ize below which splitting cannot be done 
while the second stops splitting if the ratio of deviances 
between a tree with r terminal nodes and the root node is less 
than some threshold value. 
- Node Classification/Prediction 
A node is assigned to the class with the largest number of 
observations in that node. 
Tree Pruning 
Validation Procedures 
As with CART, an overly large tree, biased towards the learning 
sample, can be grown by Splus trees(). The next step is to 
apply a pruning procedure that determines a nested sequence of 
subtrees of the original tree by cutting off branches containing 
relatively unimportant splits. This is achieved by means of a 
cost-complexity measure 
Da(T) = D(T) + a L(T) 
where Da(T) is deviance of the subtree T, s1ze L(T) is the 
number of  terminal nodes contained in T and a is the cost?
complexity parameter. By default, the procedure produces a 
sequence of subtrees that min imise the cost-complexity 
measure. Note that this algorithm is very similar to the one 
used by CART, except that CART uses a measure of 
misclassification en?or rate rather than deviance. 
A similar procedure used by Splus trees() is the shrink-tree() 
function which determines a sequence of subtrees from the 
original tree that differ in their fitted values. The function uses 
the recursion relation 
y (node) = a(y(node)) + ( l -a) y(parent) 
where y(node) is the usual fitted value for each node and 
y(parent) is the shrunken fitted value for the node's parent, 
which was in turn obtained in the same way. The technique 
basically uses a parameterization of a that optimally shrinks the 
descendant nodes to their parent nodes based on the magnitude 
of the difference between y(node) and y(parent). The result of a 
plot of deviance against size of the subtrees found by shrinking 
is a smooth decreasing curve which t1attens out as the size of 
the subtrees increase. 
A second approach Splus trees() uses to test the sequence of 
subtrees produced by either pruning or shrinking, is to use g?
fold cross-validation, 2 ? g ? n ,  to select the tree with the 
minimum cross-validated deviance rather than the m inimum 
error rate. No standard error rule is used by Splus trees() . 
Thus, like CART, the decision tree that is produced should be 
relatively robust, giving a set of  rules that are valid when 
applied to another set of data from the same population. 
57 
Interactive Ability 
Graphical Ability 
One-Stage Optimality? 
Missing Values 
Criticisms 
58 
Splus trees() has an option to allow the user to examine the 
goodness of split for each variable at a particular node. This 
information is conveyed by means of a scatter plot  for 
categorical variables and a high density bar graph for numeric 
predictors. The user is able to see what variable (and value(s) 
of that variable) is the best discriminator of the class variable. 
In order to change the splitting variable, however, from that 
chosen by the spl itting algorithm, .  the edit.tree() function is 
called with the variable to be split and its value explicitly stated. 
Trees can be drawn and labelled through Splus graphics. Either 
the use of a mouse or Splus commands can edit the tree, 
examine splits and examine the distribution of the c lasses in the 
terminal nodes. 
The method is one-stage optimal examining only the best splits 
at a current node. As seen above, though, the splitting variable 
at a particular node can easily be changed but a new tree cannot 
be grown after changing the splitting variable. 
The function na. tree.replace() is used to handle missing values 
in Splus trees() . The function creates a new level for any 
variable containing missing values, coded as 'NA' . Numeric 
predictors are first grouped into c categories. Clark and 
Pregibon ( 1 992) describe how missing values are predicted as 
follows: 
The approach we adopt is that once an NA is detected 
while dropping a (new) observation down a fitted tree, the 
observation 'stops' at that point where the observation is 
required to continue the path down the tree. This is 
equivalent to sending an observation down both sides of 
any split requiring the missing value and taking the 
weighted average of the vector of predictions in the 
resulting set of terminal nodes. 
Many of the criticismS levelled at CART would also apply to 
Splus trees().  However, "([ t] he S computing language . . .  is 
currently one of the most developed interactive programming 
environments for data analysis and graphics". (Le B lanc and 
Crowley, 1993, p 466). The interactive faci lities that Splus 
trees() has appears to give it a distinct advantage over CART. 
As Clark and Pregibon ( 1 992), p 4 15  state "[o]ur recommended 
approach to tree building is far less automatic than that provided 
by other software for the same purpose, as the unbundl ing of 
procedures for growing, d isplaying and challenging trees 
requires user in itiation in all phases". Perhaps one other 
Examples in the Literature 
3. 12 IND 
criticism of Splus trees() is that the method requires an adequate 
knowledge of the Splus language, which is not menu-driven nor 
user-friendly. 
Bradford, E ( 1 993) .  Tree-based models in S, New Zealand 
Statistician, 28, pp 36-5 L 
Morton, S C ( 1 992).  Personal crunching :  new advances in 
statistical dendrology, Chance, 5, pp 76-79. 
Author(s) W Buntine (Australia). 
Introduction Introduced in 1 99 1 ,  the technique tries to combine the 
simplicity of decision tree rules with the power of Bayesian 
methods. Bayesian methods are used for splitting, smooth ing 
and tree averaging. "IND provides a potentially bewildering 
number of options to allow the user to precisely control how 
data is interpreted,  how trees are grown and tested, and how 
results are d isplayed" (Buntine and Caruana, 1 993 , p 1 -4). As 
well, IND has the ability to simulate the CART and C4.5 tree?
based methods, or follow a minimum message length idea such 
as that used in Wallace and Patrick ( 1 993), or indeed the newly 
developed decision graph approach of Oliver ( 1993). 
Classification/Regression Used only for classification. 
Tree Growth: The tree is grown on the whole data set. 
- Splitting Method IND can choose from a number of different criteria when 
evaluating the quality of different splits or tests. For example 
the Gini and twoing splitting criterion (see Section 3 .7)  can be 
used or the gain ratio criterion, as used with C4.5 (see Section 
3 .8) .  Buntine recommends the use of Bayesian splitting which 
evaluates each possible split of  a particular node into several 
sub-nodes. 
The Bayesian estimate of the posterior probabil i ty of each 
possible spl i t  being correct is then evaluated, with the split 
producing the maximum posterior probability being carried out. 
59 
60 
- Type of Splits 
- Priors/Costs 
- Stopping Rules 
Multiway splitting is used, splitting on each distinct value of a 
variable. Only univariate splits are carried out. 
In the Bayesian mode, "class priors" in the CART sense of the 
word, do not exist. The algorithm incorporates p riors but these 
are Bayesian prior probabilities of the class probabilities in each 
terminal node prior to seeing any data. 
IND has cost structures. As the technique is Bayesian,  the 
i ncorporation of priors is trivial. The tree returns a c lass 
probability vector. The cost vector is then combined with that 
so that a minimum cost decision can be made. It' s a simple 
add-on to the tree interpretation routine. 
Splitting stops when the quality measure above for the best split 
at a particular node fails to exceed a prespecified criterion. 
- Node Classification/Prediction 
Tree Pruning 
Validation Procedures 
If classification is not determined by cost, then each terminal 
node is assigned to the class with the maximum probability 
within the terminal node. If costs are incorporated into the tree 
growing process then the terminal node is assigned to the 
minimum cost class. 
IND adopts a Bayesian approach to pruning, which uses a 
smoothing technique. The usual approach is to find the class 
probabilities of observations in the terminal nodes. The 
smoothing approach also takes into account the class probability 
vectors for all the intermediate (decision) nodes en route to the 
terminal nodes. The resultant final tree may have widely 
differing class probabilities for the terminal nodes. If the c lass 
probabilities of two or more terminal nodes are sim ilar, and 
come from the same parent, they can be pruned upwards. 
One has to divide the data into a learning samp le and a test 
sample before the tree growing process is begun.  A routine 
exists in IND to carry this out. In a personal communication, 
Buntine ( 1 993) recommends growing the tree on the full data 
set and believes that if you want to produce the best predictions 
possible, all of the data should be used to grow the tree, as per 
standard Bayesian theory. As several different class probability 
trees are grown for each data set, the weighted average of the 
class probability vectors each tree assigns to an observation can 
be taken. This is the B ayesian averaging approach which 
Buntine favours instead of learning/test samples. 
Interactive Ability 
Graphical Ability 
One-Stage Optimali ty? 
Missing Values 
Criticisms 
Examples in the Literature 
Advanced features allow the user to interactively search for the 
best spli ts and control the tree-growing process. 
IND contains a graphical display routine that is used to display 
the tree classifiers in various forms. 
No. The method encompasses an N-ply lookahead facility, 
whereby not only are the best splits examined at the current 
node, but also how the resulting descendant nodes and their 
sons should be split? For example, a 2-ply lookahead scheme 
would search for the best split at the current node as well as the 
best split at the resulting son nodes. This may lead the user to 
find that the so-called 'best' split of the current node was not 
the optimal split in terms of future tree development, as this 
variable may not interact with any other variables in the data 
set. A ' lesser' split may in fact lead to a greater reduction in 
misclassification in the next stage of tree development. The N?
ply lookahead facility allows the user to uncover such a 
structure in the data. 
Missing values are handled using Quinlan' s preferred strategy. 
That is, IND sends a case down each branch with the proportion 
found in the learning sample at that node. In effect, each case 
with missing variables is split into a number of parts, with the 
largest part going down the branch where most other cases have 
gone. Otherwise, a routine exists to send a case down the 
branch of the tree most commonly taken by other examples. 
IND is a relatively new method so there has yet to be an article 
either criticising or praising IND in the literature. It would 
perhaps be criticised for having no mechanism to incorporate 
"class priors" in the CART sense of the word. Buntine ( 1 993)  
feels that the main criticism of IND is  that he hasn ' t  optimised 
to handle all those important real-world things like real-valued 
splits, which can be done, but without much thought. 
61 
3. 13 SUMMARY TABLE COMPARING THE TEN TREE-BASED METHODS 
Attribute AID THAID ID3 CH AID CART ' 
Author(s) 1 N Morgan 1 N Morgan 1 R Quintan G V Kass L Breiman 
1 A Sonquist R C Messenger J H Friedman 
R A  Olshen and C J Stone 
Introduction 1 963 1973 1 979 1980 1 984 
Classification/Regression Regression Classification Classification Classification Both 
Tree Growth Uses all the data. Uses al l the data. Uses a subsample of U1e Uses all the data. Uses eiU1cr all or a I 
data. subsample of the data. 
Maximising the x2 
Gini or twoing splitting 
? Splitting Method Maximum between to Theta or delta splitting Maximisation of the gain criterion. 
within-group variance. criterion. criterion. statistic of grouped 
categories. 
? Type of Splits Binary/US. B inary/US. Multiway/US. Multiway!US.  B inary/US or LC. 
? Costs/Priors No. No. No. No. Yes. 
? Stopping Rules Direct stopping. Direct stopping. All cases in the learning No significant splits. All nodes are pure to one 
sample are correctly class. 
classified. 
? Node Cla?slfication!Predlction Average value of the Class wiU1 largest number As for THAID. As for THAID. As for THAID after 
cases in the tenninal of observations in the accounting for costs and 
node. node. priors. 
Tree Pruning No No No No Cost-Complexity 
Validation Procedures No No No No Yes 
Interactive Ability No No No Yes Yes/No 
Graphical Ability No No No Yes Yes 
One-Stage Optimal? Yes Yes Yes Yes Yes 
Missing Values Estimated using class Estimated using class Estimated using Omitted Estimated using 
means. means. class proportions surrogate splits 
Criticisms ? Produces overly large ? Produces overly large ? Does not know when to ? No validation of the ? Instability of cost-
trees .  trees. stop. results. complexity pruning. 
? Too dependent on sizes ? Too dependent on sizes ? No validation of the ? Tree size affected too 
of stopping rules. of stopping rules. results .  much by the standard 
? No validation of ? No validation of error rule. 
results. results. 
TABLE 3. 13  (cont'dJ 
Attribute C4.5 FACT Knowledge Splus Tree(s) IND 
Seeker 
Author(s) J R Quintan W Y Loh I3 de Vil le L Clark W Duntine 
N Vanichsetakul E Suen D Prcgibon 
D Diggs 
Introduction 1 986 1988 1 989 1 99 1  199 1  
Classification/ Classification Classification Doth DoU1 Classification 
Regression 
Tree Growth Uses a sub-sample of the Uses all the data. Uses all the data. Uses all U1c data. Uses all U1e data. 
data. 
? Splitting Method Maximisation of the gain Discrimincmt analysis. Maximising the x2 Likelihood ratio statistic. Quality measure. 
ratio criterion. statistic of grouped 
categories. 
? Type of Splits Multiway!US. Multi way/US or LC. Multi way/US . Dinary!US . D inary!US. 
? Costs/Priors No. Yes. No. No. Yes/No. 
? Stopping.Rules Direct stopping. Direct stopping. No significant splits. Deviance below a certain Quality measure below a 
value. certain value. 
? Node Classification/Prediction As for TIIAID. As for CART. As for THAID. As for THAID. Maximum class 
probabilities after 
accountin? for cost. 
Tree Pruning Pessimistic No No Cost-complexity Dayesian 
with deviances 
Validation Yes Yes No Yes Yes 
Procedures 
Interactive No No Yes Yes Yes 
Ability 
Graphical No Yes Yes Yes Yes 
Ability 
One-Stage Yes Yes Yes/No Yes No 
Optimal? 
Missing Estimated using Estimated using Creation of a Creation of a Estimated using class 
Values class proportions class means new cate?ory new catc?ory proportions 
Criticisms ? No mechanism for ? Decision rules not ? No validation of the ? Instability of cost- ? No mechanism for 
incorporating a cost simple and accurate. results. complexity pruning. incorporating priors 
structure. ? Direct stopping rules arc structure 
? Tends to produce overly used. ? Not designed for real-
large trees. ? Not robust to non- valued splits. 
nonnality. 
64 
4. SIMULATION STUDIES INVOLVING CONTINUOUS DATA 
4. 1  INTRODUCTION 
Data sets involving distinct groups of populations arise in many disciplines, including the 
socia l  sciences, business and medicine. Multivariate data sets are often not easy to analyse, 
so it is important that the method should be both powerful and easy to understand. In the 
c lassification context, the prediction rule should be accurate, that is have a low, unbiased 
error rate, yet be as easy to interpret as possible. 
In Section 4.2, through a l i terature survey, an investigation is carried out into the various 
types of error rate estimates that are used in the field of c lassification. 
Next, in Sections 4.3 and 4.4, a comparison of four c lassification methods from the domains 
of traditional discrimination and tree-based methods is done. LOA and QDA are the two 
most commonly used classification methods in statistics to handle the above type of data. 
These two parametric techniques are compared and contrasted with two tree-based methods, 
CART and FACT. (See Sections 2.2, 2 .3 ,  3.7 and 3 .9 for details on LDA, QDA, CART and 
FACT respectively.) 
This chapter compares both the accuracy and reliability of these four classification methods in 
classifying individuals into two multivariate populations under certain combinations of 
parameters. Three types of data distributions will be investigated, involving normal,  
lognormal and standardised lognormal. The robustness of  each method to a change in the 
value of the a priori probabilities of class membership will also be determined. 
4.2 ERROR RATES 
A great deal has been written on the subject of the en?or rates in classification analysis over 
the past three or more decades. In this section ,  a review of the literature on error rates is 
given as well as a formal definition of each of the error rate estimators, including those to be 
used later in this chapter. Extensive reviews of error rate estimation may be found in Kanal 
( 1974), Toussaint ( 1 974), Lachenbruch ( 1975 ), Efron ( 1982,  1 983) ,  Hand ( 1 986) and 
McLachlan ( 1986, 1987) among others. 
65 
66 
In any classification problem, the object i s  to assign a random observation x to one of k 
populations, II I >  . . .  , ilk? Paraphrasing Toussaint ( 1 97 4 ) ,  one of the most important problems 
with the use of any classification method is estimating the probability of misclassification. 
The optimal error rate of any classifier is the Bayes error rate, which is defined as 
R(B) = 1 - J maxi [fi(x) nj] dx (4.2. 1 )  
where fi(x) is the class conditional density function for IIi and 1ti i s  the a priori probability of 
belonging to IIi . According to Hand ( 1 986), p 335, "[this] is the minimum possible error rate 
given a set of [variables] ." This is the error rate that would result if  the class conditional 
density functions were known. 
The actual or true error rate, R(T), is defined as the expected probability of misclassification 
when the class conditional density functions are known and the Bayes rule is not used. In 
practice, this is the error rate that would result when applying the classifier to an infinite test 
sample. In notation form, and assuming that n 1 = n2 
R(T) = i J.. , f1 (x) dx + i J... , f2(x) dx (4.2.2) 
The probability that X falls in A I given that X ?  n2 is 
(4.2.3) 
where D(x)  is as defmed in (2.2 .5)  and 
It follows that 
Pr(D(x) < 0 I Il1) = <D [ -DJ;!) ] (4.2 .4) 
so that 
(4.2.5) 
Realistically, the samples at hand are always finite and the c lass conditional density functions 
are often not known. Therefore, another measure of classification error is needed to be used 
for a classification rule. One such measure is the expected error 
R(E) = E[R(T)] (4.2.6) 
for learning samples of a given size. In practice, however, the amount of data at hand does 
not allow for the estimation of R(T). Hence, another estimate of R(T) is needed. 
Two approaches are available for the estimation of R(T). The first uses functions which 
combine the sample estimates of the class means and covariances with the number of 
variables and sample size in a data set to estimate the error rate. These methods are known as 
parametric error rate estimators. Examples of these types of estimators are the L estimator 
of Lachenbruch ( 1 967) , the M estimator of McLachlan ( 1 974), and the NS and NS* 
smoothing estimators of Snapinn and Knoke ( 1985,  1988) .  These methods have been shown 
to provide accurate, unbiased estimates of the actual error rate as seen in their use with LDA 
(Prada Sanchez and Otero Cepeda, 1989 and Ganeshanandam and Krzanowski, 1 990) and 
with stepwise discriminant analysis (Snapinn and Knoke, 1989) in the case of normal or other 
symmetric distributions, such as the uniform distribution. It has been noted that "caution 
should be exercised with the use of the parametric estimators of the error rates as they may 
not be reliable under departures from the parametric model adopted" (McLachlan, 1 986, p 
27 1 .) Indeed, recent studies by Snapinn and Knoke ( 1989) and Konishi and Honda ( 1 990) 
have shown that these parametric estimators do not perform well with skewed distributions. 
Konishi and Honda found that parametric estimators, in particular, the M estimator above, 
should be applied with extreme care in the case of skewed distributions such as the 
lognormal. Snapinn and Knoke found that the smoothed bootstrap estimator deteriorated 
markedly from the normal case, when used in stepwise discriminant analysis for lognormal 
67 
68 
d istributions. Since simulation stud ies in this chapter involve lognormal data, parametric 
error rate estimators will not be used. 
The second approach available for the estimation of the actual error rate are non-parametric 
error rate estimators, where no assumptions about fi(x) are required. All such methods can 
be called error count estimates in that they involve counting the number of falsely classified 
observations determined by the classification rules. 
Hand ( 1 986), describes error rate estimation as a relatively simple process if in addition to the 
data set [(x 1 , .. . , Xn)] from which the classifier was designed, there is available an independent 
test set of observations [ (x 1 , . . .  , X m )] sampled from the same distribution. This error rate 
estimator is defined as the test sample estimate of R(T), R(TS). In practice however, there is 
very seldom any additional data available. If there is, then the test set is often small and 
although providing an unbiased estimate of R(T), it has large variance. That is, if many test 
samples of small size are obtained and a classifier is trained on these then the individual test 
sample error rates will vary between being pessimistically and optimistically biased, though 
most probably average out to zero. 
A second, more feasible approach is to train the classifier on the learning sample L = (x 1 ,  . . .  , 
xn) from which the classifier was formed. This estimator, R(A), is known as the apparent or 
resubstitution error rate estimator. Authors such as Lachenbruch and Mickey ( 1968), among 
others, have shown that this method is optimistically biased in LOA in that it almost always 
underestimates R(T). Efron ( 1 983) ,  offers an explanation for this phenomenon. R(A) is the 
error rate for points zero distance from the learning set while R(T) is the expected error rate 
for a new observation, x0, which may be some distance away from the learning set. If the 
error rate of the prediction rule increases as the point being predicted moves away from L 
then R(A) will underestimate R(T). In other words, it is unlikely that all future observations 
will l ie within the range of values spanned by L or be distributed in the same manner as L .  
The classification rules are designed to  optimise the error rate for all observations in L.  
Therefore, a test sample which either has values outside the range of the values in L or not 
distributed the same way as L will have a larger error rate than R(A).  
A third non-parametric method of estimating R(T) is the hold out error rate estimator, R(H). 
This is found by dividing the data set into two and using one part as the learning sample to 
construct the classifier and the other part as the test sample. As with R(TS) ,  the test sample 
estimate of the error rate, R(H) has large variability when n is small. As well " . . .  it is an 
inefficient use of data - one would like to use all the available data to design the best possible 
classifier" (Hand, 1986, p 336). Toussaint ( 1 974), quotes numerous studies which have found 
R(H) to have a pessimistic bias in estimating R(T). 
The rotation method makes more efficient use of the data. The idea is to divide the data into 
two halves and use each half in turn as a learning sample and a test sample. By averaging the 
two test sample error rate estimates the rotation error rate estimate, R(ROT) , is obtained. 
Although making a more efficient use of the data, Toussaint ( 1 974) still found this method to 
be pessimistically biased. 
The above idea was extended to divide the data into g mutually and randomly chosen sets of 
data of size nlg. The method of g-fold cross-validation omits each of the g groups in turn 
from the data set, calculates a classification rule based on the remaining (n - (nl g))  
observations and classifies the omitted group of observations. Then it  counts the total number 
of misclassified observations divided by the size of the data set to get an estimate of the error 
rate. Toussaint ( 1 974), refers to this as the IT method. When g = 2, that is, two-fold cross-
validation, this is the rotation method. When g = n, this is the n-fold cross-validation error 
estimator, R(CV), attributed to Lachenbruch ( 1 967), where, in the case of two populations 
2 2 
R(CV) = I, L ni/ni i= l  j=l (4.2 .7)  
This method is also known as the ' leave-one-out' or U estimate. Studies undertaken by 
numerous authors including Efron ( 1 983 )  have shown that n-fold cross-validation has large 
variance. Thus, although R(CV) may be an unbiased estimate, the confidence with which the 
user can expect R(CV) for his/her sample to approach R(T) is not great. 
The jackknife error rate estimate is due to Quenouille ( 1 949). The method involves omitting 
each observation in turn from the learning sample and to obtain the apparent error rate for the 
* 
learning sample with the j th observation omitted, Rj (A), so that 
69 
70 
* 1 n * R1 (A) = - I, R (A) n . 1 J J= 
so that w1, the j ackknife estimate of the bias of R(A), is 
* 
w1 = (n- 1 )  [R/A) - R(A)] 
leading to the jackknife estimate of the error rate 
* 
R(J) = n R(A) - (n- 1 )  RT (A) 
(4.2.8) 
(4.2.9) 
(4.2 . 1 0) 
Another approach is to use the estimated posterior probabilities of class membership, 8i(x), 
where 
(4.2 . 1 1 ) 
is the posterior probability that X belongs to fh An observation, X, is assigned to ni  if 
(4.2. 1 2) 
This naturally leads to the posterior probability error rate estimator, R(PP), where each 
observation is not assigned outright to a population; rather it is given an estimated probability 
of membership to each population. This estimator takes the form 
n 
R(PP) = I, min 8m(xj)/n 
j=l 
(4.2 . 1 3) 
Glick ( 1 978)  has shown this estimator to he optimistically biased,  though with smaller 
variability than R(A). Ganesalingam and Lynn ( 199 1 )  have considered posterior probability 
error rate estimation in the context of a mixture of two normal populations. They also found 
that R(PP) generally underestimated R(T). 
A recent development in the field of error rate estimation is the bootstrap error rate estimator, 
R(B), due to Efron ( 1 979) and developed further in Efron ( 1 982, 1 983). The idea is as 
follows. Let xj be a random observation from Cj, and let xo be a new observation that is to be 
classified, belonging to Co. Let the xj 's and xo be from the entire population mixture 
d istribution, F(x). Letting to be the predicted class of x0 using the classifier constructed 
from L. Efron ( 1983) defines a loss function 
" {0 Q(Co, Co) = 1 if to = c0 else 
Therefore, the actual error rate of the classification rule is 
R(T) = E[QCC0, Co)J 
while the apparent error rate is 
1 n " R(A) = - I Q(C1? ,  C1?) n . 1 J= 
The true bias involved in using R(A) as an estimate of R(T) is 
w = E[R(T) - R(A)] 
(4.2. 1 4) 
(4.2. 1 5) 
(4.2. 1 36 
(4. 2. 1 7) 
If w was known, then an accurate estimate of R(T) could be obtained. In theory, there is no 
knowledge of w. Bootstrapping estimation is an attempt to approximate w by calculating wB , 
the bootstrap estimate of the bias involved in using R(A) as an estimate of R(T). 
The basic sampling procedure behind bootstrapping is a clever, yet simple idea. In the 
* * * univaria te case, let (x1 , x2 , . . .  , xn ) be a random sample of observations drawn with 
replacement from L, with weight 1/n placed on each observation in L. This is known as the 
bootstrap sample. A classification rule is constructed from the bootstrap sample and the 
apparent error rate estimate, R*(A), for the classifier is found. In addition, the cases from L 
* * are classified using the rules generated from (x 1 , . . .  , xn). Then 
71 
72 
* * * 
(4.2. 1 8) 
where pjb is the resampled proportion of observations m (x 1 , . . .  , xn) .  Therefore, only 
* * A observations that are not in  (? , . . . , xn) will contribute positively to wb? B bootstrap samples 
are generated in the same way and thewb are calculated in the same way for each bootstrap 
sample. These are then averaged over the B samples to get the bootstrap estimate of the bias 
of R(A) , that is 
B 
wB = I, wJB = E[R*(T) - R* (A)] (4.2 . 19) 
b=l  
so that, R(BOOT) , the boostrap bias-corrected estimate of  the actual error rate is 
R(BOOT) = R(A) + wB (4. 2.20) 
Variants on the bootstrap have also been proposed by Efron ( 1982) .  These include the 
randomised bootstrap whereby empirical bootstrap samples are drawn from L with the 
proportion of observations drawn from ni a preset value. For example, the numbers drawn 
from each class could be set to be proportional to class sample sizes. Then if n 1 = 0.6,  60% 
of each bootstrap sample would be taken from n 1 . Another variant is the double bootstrap 
which was designed to correct the observed optimistic bias of the ordinary bootstrap. The 
process involves taking another lot of bootstrap samples to correct the above-mentioned bias 
of the ordinary bootstrap. 
A third variant is the 0.632  estimator, defined as 
R(0.632) = 0 .368 * R(A) + 0.632 * R(?) (4.2.2 1 )  
where R(?) is the average error rate for all observations not in a bootstrap sample over all B 
bootstrap samples. The 0.632 estimator was developed by considering the distribution of the 
distance 8 between the point where the classification rule is applied and the closest point in 
the learning sample. It was noted that observations in the bootstrap sample have a high 
probability of being instance 8 = 0 away from observations in L, whereas the reverse would 
occur i f  another independent sample of data from the same distribution was taken. "Their 
probability is equal to the probability that the poin t  at which the rule is applied is included in  
the bootstrap sample, which i s  1 - ( l - 1/n)0 and  tends to  0.632 as n ---7 oo." (McLachlan, 1 987, 
p 234.) The bias of R(A) estimated by the 0.632 estimator is 
wo.632 = 0.632 (R(A) - R(t:)) (4.2 .22) 
so that 
R(0.632) = R(A) - w0_632 = 0.368 * R(A) + 0.632 * R(E). 
Efron ( 1 983)  showed that the asymptotic expansion of R(t:) was very similar to the 
asymptotic expansion of R(ROT) and that the correlations between R(t:) and R(ROT) in the 
simulation studies undertaken by him were very high (range 0.86 - 0.98 ) .  This implies that 
the estimator 
R(0.632) = 0.368 * R(A) + 0.632 * R(ROT) (4.2 .23) 
is almost the same as (4.2 .2 1 ) .  McLachlan ( 1 977), Wernecke, Kalb and S turzebecher ( 1980) 
and Wernecke and Kalb ( 1 983) have considered similar estimators to (4.2.23) .  McLachlan 
( 1 977) tried to find the parameter 't which lead to the greatest reduction in the bias of the 
apparent error rate, whereby 
R(-r) = -r * R(GCV) + ( 1 --r) * R(A) (4.2 .24) 
where 0 :::; t :::; 1 and R(GCV) is the g-fold cross-validation error rate estimate. McLachlan 
found that very little weight should be given to R(A), that is, 't close to 1 ,  unless G is set to 2 ,  
where 't varied from 0 .3  to 0.4. Simulation studies undertaken by  Chernick, Murthy and 
Nealy ( 1 985, 1 986) have shown that R(?) is overly pessimistic, as is R(ROT). Therefore, a 
weighting function of a pessimistic estimator (R(t:) or R(ROT)) and an optimistic estimator 
(R(A)) seems a very logical step. 
Simulation studies undertaken by Efron ( 1983) ,  Chernick et al ( 1 985) and Fitzmaurice et al 
( 1 99 1 )  with LDA, Gong ( 1 986) with logistic regression and Crawford ( 1 989) with CART 
have found that the bootstrap and more particularly the 0.632 estimator are unbiased as well 
as having low variability .  Rather contradictory results from those reported above were 
73 
obtained by Ganeshanandam and Krzanowski ( 1 990). In a study i nvolving error rate 
estimation in two group discriminant analysis, they found that the 0.632 error rate always 
estimated the actual error rate in the vicinity of the 0.3 to 0.4 range. This meant that the 
method was best when either the Mahalanobis distance between populations or  sample size 
was small, while estimation of the actual error rate was overly pessimistic for large samples 
and/or large Mahalanobis distance between populations. They also found that contrary to 
Efron ( 1983) and others, the n-fold cross-validation estimator, R(CV), performed e:;pecially 
weli. the:.\gh its relative variability did increase as sample size decreased .  
4.3 SIMULATION STUDY I 
4.3. 1 Study Plan 
74 
Eighty different  bimodal probability models were generated by usmg every possible 
combination of five different factors. The tive factors used were; the number of variables (p), 
total sample size (n) ,  Mahalanobis distance (o), type of distribution (f(x))  and pnors?
covariance structure (e) . The values of the tirst four factors, each at two levels, were 
p = 2, 6 
n = 60, 300 
0 = 2, 3 
f(. )  = normal, lognormal 
(4 .3 . 1 )  
However, for this study, the lognormal data i s  transformed to have mean Jli and covariance 
matrix Li for Di, by letting 
x ? ? - E(x ? ?) 11 11 
Zij = S.d.(Xjj) ' 
which is lognormal (0, 1 ) . 
i = l ,  2,  and xi i  is lognormal 
Hence, the distribution is standardized lognormal rather than pure lognormaL 
The fifth factor, e, with five levels, had values: 
1t1 = 0.5:  l: 1  = l:2 = I  ( 1 )  
1t1 = 0.5: l: 1 = I, l:2 = 3l: 1 (2) 
1t1 = 0.25: l: 1 = l:2 = I (3) 
1t1 = 0.25 : l:1 = I , l:2 = 3l: I (4) 
1t1 = 0.75: l:1 = I , l:2 = 3l: I (5) 
(4.3.2) 
The values of the first three factors were carefully chosen, both from examples in the 
literature and to approximate real world data situations. The extreme situations encountered 
in this study were at one end, small, bivariate, moderately separated populations while at the 
other extreme had large, six-dimensional, well separated populations. 
Other authors have used the pure lognormal to compare with the normal in simulation studies. 
However, the use of the pure lognormal means that if observations in n2 have a larger mean, 
the range of values will be much larger, resulting in a much higher covariance for n2 than for 
n 1 . Therefore, methods are not only being compared across degree of skewness of a 
distribution ,  but also across covariance structures, which are now different. Lachenbruch et al 
( 1 973)  noted this problem but took no action to correct it. The use of the standardised 
lognormal d istribution, on the other hand,  preserves covariance structure but maintains the 
degree of skewness. 
The values of the fifth factor were based on those used in other authors' simulation studies. 
More extreme covariance differences between populations were considered, for example, 
:I:2 = 10l: 1 , but the present values were chosen to ret1ect real-world situations. 
The effects of the experimental factors on the four classification methods, LOA, QDA, CART 
and FACT were investigated, using a split-plot design as employed by Ganeshanandam and 
Krzanowski ( 1 990) . The experimental factors were given in the experimental factor (main 
p lot) stratum while the main effect of classification method (R),  together with all first and 
second-order interactions involving R were contained in the method (sub-plot) stratum. All 
75 
76 
second order interactions and above in the experimental factor stratum and all third order 
interactions and above in the method stratum were pooled with the respective error variances. 
Individual analyses of the main effects and first order interactions for the four methods were 
also carried out in order to compare the effects of the different experimental factors within 
each method separately, and also because the mean square errors for each method were found 
to be unequal. 
In order to obtain an unbiased estimate of the error rate for each run, g -fold cross-validation 
was carried out. Ideally it would be desirable to set the value of g = n for all methods, but 
using this caused the FACT program to crash on many occasions, due to either a t1oating 
point error or cross-validation samples being too unbalanced or too sim ilar at some node. On 
those occasions where n-fold cross-validation was successful, however, the ten-fold and n?
fold cross-validation error rates were found to be nearly identical. Hence, as suggested by 
Loh ( 1988), the number of cross-validations for FACT was set to ten. 
G-fold cross-validation was used by Ildiko and Lanteri ( 1 989), Feng et al ( 1 993) and Schaffer 
( 1 993)  as a means of comparing various classification methods us ing real data sets . 
Although, as seen in  Section 4.2, it was noted by many authors that cross-validation had large 
variance for small samples, most of the reported results were for smaller sized samples than 
occur here. 
4.3.2 Results 
The results from the experimental factor s tratum of the ANOV A are not of interest in this 
s tudy, indicating only if the experimental factors had any effect on the error rates for all 
methods considered as a whole. The method stratum,  however, gave more relevant results 
showing that the R (method) main effect (F = 72. 1 3) and the R * f(x) (method by distribution) 
interaction (F = 54.98) were extremely significant although all other first order interactions 
were also highly significant with smaller F-ratios (range: F = 9.37 to F = 1 7 .04), except R * e 
(the interaction of method by priors-covariance structure) which was not significant. Five 
second order interactions were also found to be significant, but all except two were only just 
significant, those being R * p * f(.), where p is the number of variables (F = 60.0 1 )  and R * 8 
* f(.) ,  where 8 is the distance between groups (F = 4.24). These results not only showed that 
there were differences in error rates between the four methods but also that a comparison of 
error rates between the four methods depended on the factors p, n (sample size) , 8 and f(. )  as 
well as the p * f(. )  and 8 * f( . )  interactions. 
Tables 4. 1 to 4.7 give the means and standard errors of the differences i n  means for the 
computed error rates of the main effects and two most significant first order interactions for 
each method, with the method producing the lowest error rate for each factor or interaction 
given in bold. Three standard errors for the differences between the means are given below 
each table. They are: 
The standard error of the difference between the QOA means. 
The standard error of the difference between the LOA, CART and FACT means. 
The standard error of the difference between QDA and the other three methods. 
This was carried out because LOA, CART and FACT had roughly equal mean square errors 
from the individual ANOVA's while QOA had a mean square error roughly twice that of the 
other three methods. 
It was found that increasing the number of variables increased the error rate for all of the 
methods, with the largest effect for the individual ANOVA's being for FACT (F = 1 68.22),  
with the error rate being 9.9 points larger when p = 6 compared to p = 2 .  Table 4. 1 shows 
that CART produced the lowest average error rate, no matter what the value of p, though the 
average error estimates for LOA were both within 2 .5% of the CART error rates. QOA, 
however, was influenced by distribution as well. This will be discussed in a later paragraph 
on interactions. 
Table 4.1: Means and standard errors of the d i fferences in mea ns of the cross?
val idation error rate estimates for each classificat ion metho d  with 
respect to the dimension (p) 
Level LDA QDA CART FACT 
p = 2  0.089 0.096 0.068 0. 1 14 
p = 6  0. 1 1 5 0. 144 0.102 0 .2 1 2  
Standard error of the difference between the QDA means = 0.01 1 
Standard error of the difference between the LOA, CART and FACT means = 0.008 
Standard error of the difference between QDA and the other three methods = 0.01 0 
Increasing sample size had the effect of reducing the error rates for all the methods with LDA 
having the largest effect for increasing n (F = 94.67). Table 4.2 shows that CART produced 
77 
78 
the lowest average error rate, no matter what the value of n, though as for p, the average error 
estimates for LDA were within 2.5% of the CART error rates . 
Table 4.2: Means and standard errors o f  the d ifferences in means of the cross?
val idat ion error rate estimates for each classificat ion method with 
respect to the sample size (n) 
Level LDA QDA CART FACT 
n = 60 0. 1 1 1  0. 1 36 0.101 0.200 
n = 300 0.093 0. 104 0.069 0. 1 26 
S tandard error of the difference between the QDA means = 0.0 1 1  
S tandard error of the difference between the LOA, CART and FACT means = 0.008 
Standard error of the difference between QDA ami the ot her three methods = 0.0 1 0  
Increasing the distance between groups also had the effect of reducing the error rates for all 
the methods with LDA having the largest effect for increasing 8 (F = 82.63).  Table 4.3 shows 
that if 8 = 3, LDA did slightly better than CART, but when 8 = 2, the average error rate for 
LDA was 4% more than that for CART. CART, and to a lesser extent, QDA and FACT, 
were also influenced by the distribution. This will be discussed in a later paragraph on 
interactions. 
Table 4.3: Means and standard errors o f  the di fferences in  means o f  the cross?
val idat ion error rate estimates for each class i fi ca t ion method with 
respect to the distance between groups (8) 
Level LDA QDA CART FACT 
8 = 2 0. 1 39 0. 1 6 1  0.097 0. 1 93 
8 = 3 0.065 0.079 0.073 0. 1 3 3  
Standard error o f  the difference between the QDA means = 0.0 1 1  
Standard error o f  the difference between the LDA, CART and FACT means = 0.008 
Standard error of the difference between QDA and the other three methods = 0.0 10 
The most interesting finding however, was with respect to the distribution of the data set. If 
f(. )  was lognormal rather than normal the e rror rate would decrease for LDA, CART and 
FACT, whereas for QDA, the error rate would significantly increase (F = 6.35). The LDA 
finding does not support Lachenbruch et al ( 1973),  but in that study the pure lognormal 
distribution was used . CART was found to be most sensitive to changes in  f(.). Table 4.4 
shows that when f(. )  was normal, QDA produced the lowest error rate, s lightly lower than 
LDA and moderately lower than CART, but when f(. )  was lognormal the average error rate 
for CART was at least 5% less than that for LDA, which in turn had a mean error rate at least 
5% less than that for QDA. The error rates for some of the methods though were affected by 
the dimension and/or distance between the groups. 
Table 4.4: Means and standard errors of the d ifferences in means of the cross?
val i dation error rate estimates for each class i fi cation method with 
respect to the type of distribution, (?(.)) 
Level LDA QDA CART FACT 
f(x) = normal 0. 1 2 1  0. 106 0. 1 40 0.200 
f(x) = lo!:!normal 0.083 0. 1 34 0.030 0. 1 26 
Standard error of the difference between the QDA means = 0.0 1 1  
Standard error of the difference between the LOA, CART and FACT means = 0.008 
Standard error of the difference between QDA and the other three methods = 0.0 1 0  
Table 4 .5  shows that the error rates for QDA when f(.) was normal were not  greatly affected 
by the size of p, but when f(.) was lognormal, increasing p almost doubled the error rate for 
QDA. QDA did best on all occasions where f( . )  was normally distributed, while CART did 
appreciably better where f(. )  was lognormal. 
79 
80 
Table 4.5: Means and standard errors of the d i fferences in means of  the cross?
val i dation error rate estimates for each classifi cat ion method with 
respect to the dimension-distribution interaction (p * f(. ) )  
Interaction LDA QDA CART FACT 
p = 2,  f(. )  = normal 0. 1 1 2 0.100 0. 1 1 1  0. 1 5 1  
p = 2, f(.) = lognormal 0.066 0.09 1 0.025 0.077 
p = 6,  f(.) = normal 0. 1 30 0. 1 12 0. 169 0.249 
p = 6, f(. ) = lognormal 0. 100 0. 1 76 0.034 0. 1 76 
Standard error of the difference between the QDA means = 0.016 
Standard error of the difference between the LOA, CART and FACT means = 0.01 1 
Standard error of the difference between QDA and the other three methods = 0.0 1 3  
Table 4.6 shows that CART, and to a lesser extent, QDA and FACT, were affected more by 
f(.) for 8 = 2 than for 8 = 3 .  As for the p * f( . )  interaction, QDA did best on all occasions 
where f( .) was normally d istributed, while CART did appreciably better where f( . )  was 
lognormal. 
Table 4.6: Means and standard errors of the differences in means of the cross?
val i dation error rate est imates for each classi fi cation method with 
respect to the distance-distribution interaction (8 * f(. ) ) 
Interaction LDA QDA CART FACT 
8 = 2, f(. )  = normal 0. 1 60 0.1 39 0. 1 62 0.238 
8 = 2 ,  f(. )  = lognormal 0. 1 1 8 0. 1 83 0.03 1 0. 148 
8 = 3, f(. )  = normal 0.082 0.073 0. 1 1 8 0. 1 62 
8 = 3,  f(.) = lognormal 0.()48 0.085 0.028 0. 105 
Standard error of the difference between the QDA means = 0.016 
Standard error of the difference between the LOA, CART and FACT means = 0.01 1 
Standard error of the difference between QDA and the other three methods = 0.013 
From Tables 4. 1 through 4.6 it  can also be seen that the average error rate for FACT exceeded 
all others except where f(.) was lognormal and 8 :t:: 3.  
Of the five-level factor, only the fifth level was found to be significant for any method, and 
that for LDA only, implying that using LDA on samples which had more observations in the 
group with the smallest variance will increase the error rate from the ideal equal priors, equal 
variance instance. 
Table 4.7 gives the results for the five level factor for completeness, showing that CART 
produced the lowest error rate every time, but the differences between the methods only 
mirrored the overall differences between the methods, taken over all 80 data sets. 
Table 4.7: Means and standard errors of the differences in means of  the cross?
val idation error rate estimates for each class i fi cation method with 
respect to the priors-covariance structure (e) 
Level --- - - ---- - - - - ---
1t} = 0.5:  l: J = l:2 = 1 
1t1  = 0.5 : l: 1 = l, l:2 = 3l: 1  
1t} = 0.25: 1: 1  = 1:2 = 1  
1t 1  = 0.25: l: 1  = I, l:2 = 31: 1 
1t1  = 0.75: l: 1  = I, l:2 = 31:1 
LDA . . . . .  
0. 106 
0. 1 00 
0.086 
0.090 
0. 1 28 
QQ:A 
0. 1 27 
0. 1 28 
0. 107 
0. 1 14 
0. 1 23 
CART ???--? . --? ??? - -
0.099 
0.075 
0.079 
0.072 
0.100 
Standard error of the difference between the QDA means = 0.017 
FACT -- ----????---- --?  
0. 1 7 1  
0. 1 66 
0. 1 60 
0. 1 57 
0. 1 6 1  
Standard error of the difference between the LOA, CART and FACT means = 0.012 
Standard error of the difference between QDA and the other three methods = 0.01 5 
ANOVA's  were also calculated for the size of the decision trees from CART and FACT on 
the experimental factors. On average, a CART tree contained 4.35 terminal nodes with only 
increasing p and/or a lognormal data set having any significant influence on the size of the 
tree. The average size of a FACT tree also contained 4.35 terminal nodes, but depended on 
many factors. For both CART and FACT, various rules were used to limit the size of a tree to 
less than ten terminal nodes involving either increasing the size below which a node cannot 
be split for both CART and FACT, or selecting the smallest tree within b standard errors, 0 $ 
b $ 2,  of the tree with the smallest cross-validated error rate, where the standard error of the 
m isclassification cost is 
81 
82 
seR(CV) = [R(CV) ( 1  - R(CV)) I n]O.S 
where n is the size of the data set and R(CV) is as defined earlier. This last rule only applies 
for CART. 
4.3.3 Summary 
In  this study, it was found that increasing the d imension significantly increased the error rate 
while increasing either sample size or distance between groups both significantly decreased 
the error rate for each of the four methods. Using a data set that was lognormally rather than 
normally distributed significantly reduced the error rate for each of LDA, CART and FACT, 
but increased the error rate when QOA was used_ The size of a FACT tree depends on many 
different criteria, but CART is only int1uenced by the distribution, normal or lognormal, 
and/or the dimension of the data set. 
This study has also shown that CART performs beuer (on average) than either LDA or QOA 
no matter what the sample size or number of variables. LOA performed slightly better on 
average than CART when the distance between groups was large and when the data set was 
normally distributed. In the latter situation, QDA even outperformed LOA. CART, on the 
other hand, performed much better than both LOA and QOA when the distance between 
groups was not so large, and moderately better than LOA and considerably better than QOA 
when the data set was lognormaL 
The differences in error rates between CART and LDA or QOA for the data sets with 
different priors and/or covariance structures, was found to be negligible, whereas for the two 
most significant first-order interactions, the dimension * distribution and the distance * 
distribution interactions, CART performs best on average except where the data set was 
normally distributed for which QOA did best. QOA performed poorly when the distribution 
was lognormal though, while FACT had the largest error rate for almost every run, 
performing especially badly when there were six variables in the data set. 
It can thus be concluded that CART, a tree-based, non-parametric method, will in many cases 
perform as well as if not better than the usual parametric methods of classification, LOA and 
QDA, constructing a tree that is usually not too large. Only when the distribution of the data 
set is normal or the distance between the popuations is large does either LDA and/or QDA 
perform better than CART. From a predictive point of view then,  CART is a narrow winner. 
Quinlan ( 1993) noted the distinction between parallel classification problems whereby all the 
variables have equal weightings, so that the classification rules depend on all the variables, 
and sequential classification problems whereby only a few of the variables contribute to the 
classification rules generated . He suggested that connectionist methods such as LDA and 
QDA are preferred for parallel classification problems while symbolic methods such as tree?
based procedures are best for sequential problems. In this study, all the data sets were 
examples of parallel classification problems, and not suited to CART (and FACT). Hence, 
CART has been shown to perform relatively well for classification problems for which it is 
not particularly suited. 
4.4 SIMULATION STUDY II 
4.4. 1 Study Plan 
It was decided, in retrospect, to compare normally distributed data with the pure lognormal, 
so that comparisons cou?d be made with the work of other authors (for example, Lachenbruch 
et al, 1 973  and Chinganda and Subrahmaniam , 1 979) .  This is, in fact, a monotonic 
transformation of the variables so the results for CART are invariant under either type of 
distribution. In the previous section, each observation was d ivided by the original class 
standard deviations. As the standard deviation for fl2 was greater than that for TI 1 ,  in the 
lognormal case, the rankings of the xij '  s would not remain the same after transformation. 
Therefore, another simulation study was carried out using exactly the same probability 
models that were used in  Section 4.3 except that f(. )  was either normal or true lognormal 
(without standardization) so that log[f(.)] was normal. 
4.4.2 Results 
As in Section 4.3, the results from the probability model stratum of the ANOV A are not 
useful for this study, indicating only if the probability models had any effect on the error rates 
for all methods considered as a whole. The method stratum,  however, gave more relevant 
results showing that the R(method) main effect (F = 74.57 )  and the R * f(.) (method by 
distribution) interaction (F = 52.45) were extremely large while all other first o rder 
83 
84 
interactions were highly significant (range: F = 5.66 to 26. 1 9). Five second order interactions 
were also found to be significant at the 5 %  level with four of those interactions also being 
significant at the 1 %  level, those being the interactions of R * p * 8 where p is the number  of 
variables and 8 is the distance between groups, R * p * f( . ) ,  R * 8 * f(. )  and R * f(.)  * e, where 
e is the prior-covariance structure of the d_ata. Residual mean square for the analysis was 
1 .35 x w-3. These results showed that a comparison of error rates between the four methods 
depended on the factors p, n (sample size) ,  8, f( . )  and e as well as the p * 8, p * f(. ) ,  8 * f(. )  
and f(. )  * e interactions. 
Tables 4.8 to 4. 1 6  give the means and standard errors of the differences in means for the 
computed error rates of the main effects and the four most significant first order interactions 
for each method, with the method producing the lowest error rate for each level of the factor 
or interaction given in bold. Three standard errors for the differences between means are 
given below each table . They are: 
The standard error of the difference between the FACT means. 
The standard error of the difference between the LOA, QOA and CART means. 
The standard error of the difference between FACT and the other three methods. 
This was carried out because LDA, QOA and FACT had roughly equal mean square errors 
from the individual ANOVA's while FACT had a mean square error roughly five times that 
of the other methods. 
It  was found that increasing the number of variables increased the error rates for CART and 
FACT, with the largest effect being for CART (F = 78.02), whereas for LOA, the error rate 
decreased (F = 1 6.56) .  There was no real effect on the QOA error rate when increasing p. 
Table 4.8 shows that CART produced the lowest average error rate when p = 2 but when p = 
6,  QDA did best. Note though that CART and FACT were int1uenced by distance as well 
while LDA and FACT were also int1uenced by the distribution. 
Table 4.8: Means and standard errors of the d i fference in means of the cross?
validation error rate estimates for each classifi cation method with 
respect to  the dimension (p) 
Level LDA QDA CART FACT - -
p = 2 0. 198 0. 1 28  0.1 1 1  0. 1 74 
p = 6  0. 173  0. 1 29 0. 1 69 0.23 1 
Standard error of the difference between the FACT means = 0.014 
Standard error of the difference between the LDA, QDA and CART means = 0.006 
Standard error of the difference between FACT and the other three methods = 0.01 1  
Increasing sample size had the effect of reducing error rates for CART and FACT, with 
CART having the largest effect (F = 22.37), hut had no real effect on the error rates for LOA 
and QDA. Table 4.9 shows that QDA produced the lowest average error rate when n = 60 
while the difference between CART and QDA was negligible for n = 300. 
Table 4.9: Means and standard errors of the d i fference in means of the cross?
val idation error rate estimates for each cl assification method with 
respect to the sample size (n) 
Level 
------- - . . . . . . . ??-
n = 60 
n = 300 
LDA ?? ? ? - ?- ? ? ?? ? --? -? ?? 
0. 1 9 1  
0. 1 80 
Q??-
0. 133 
0. 1 25 
Standard error of the difference between the FACT means = 0.014 
CART FACT ???? ?------?-
0. 1 56 0.235 
0.124 0. 1 70 
Standard error of the difference between the LDA, QDA and CART means = 0.006 
Standard error of the difference between FACT and the other three methods = 0.01 1  
Increasing the distance between groups had the effect of reducing error rates for all the 
methods with QDA having the largest effect for increasing o (F = 2 10.37). Table 4. 1 0  shows 
that if o = 2, CART did slightly better than QDA but when o = 3, the average error rate for 
CART was 3% more than that for QDA. Note though that CART and FACT were also 
influenced by dimension while LDA, QDA and FACT were also int1uenced by distribution. 
85 
86 
Table 4.10: Means and standard e rrors of the difference in means of the cross?
val i dation e rror rate estimates for each classifi cation method with 
respect to the distance between groups (8) 
Level LDA _ _ _  QDA_ CART FACT -- -- ??--... ?-?4 -4--?-?- ----- ?--?-- - ????? 
8 = 2  0.2 1 5  0. 173  0. 162 0.228 
8 = 3  0. 1 56 0.085 0. 1 1 8 0. 1 77 
Standard error of the difference between the FACT means = 0.014 
Standard error of the difference between the LDA, QDA and CART means = 0.006 
S tandard error of the difference between FACT and the other three methods = 0.01 1 
- -
It was found that if f(. )  was lognormal rather than normal, the error rate would increase for 
LDA and QDA but have no real effect for CART and FACT. For CART, there was no 
change at all to the results due to the fact that CART is invariant under all monotone 
transformations of the variables. LDA was found to be most sensitive to changes in f( . )  
(F = 408 .96) supporting the findings of  Lachenbruch et al ( 1 973). Table 4. 1 1  shows that 
QDA did best when f(. )  was normal while CART did best when f(. )  was lognormaL The error 
rates for LDA and FACT were also int1uenced by dimension while all other methods except 
CART were int1uenced by distance and priors-covariance structure. 
Table 4.11: Means and standard errors of the difference in means of the cross?
val i dation error rate estimates for each class i fication method with 
respect to the distribution of the data set (f( . ) )  
Level 
f(. )  = normal 
f(. )  = lognormal 
LDA 
- -?----- ?------- -----
0. 1 2 1  
0.250 
__ _ _ _  QDA ___ 
0. 1 06 
0. 1 5 1  
Standard error of the difference between the FACT means = 0.014  
CART FACT 
0. 140 0.200 
0. 1 40 0.205 
Standard error of the difference between the LDA, QDA and CART means = 0.006 
Standard error of the difference between FACT and the other three methods = 0.0 1 1 
All methods were affected by changing the priors-covariance structure of the data with the 
largest effect being for LDA (F = 48.79),  which had a large increase i n  error rate when 
1t1 = 0.25 and I-2 = 32.1 .  CART was the least affected by any change in priors-covariance 
structure. Table 4. 1 2  shows that QDA did best on all occasions where I- 1 "# I.2 . Note, 
though, that all methods except CART were also affected by the distribution. 
Table 4.12: Means and standard errors of the difference in means of the cross?
val idation error rate estimates for each class i ficat ion method with 
respect to the priors-covariance structure (e) 
Level LDA - Ql)? -- CART FACT ---- -?-? . _ .... _ ??- ?-? -?- - .  . ? . . . . . - -?-- ? ?-? - ?  
1t1 = 0.5 : l: 1 = 'L2 = I 0.154 0. 1 63 0. 178  0.225 
1t1 = 0.5 : l: 1  = I. l:2 = 3'L I  0. 1 85 0.092 0. 1 24 0. 1 89 
1t1 = 0.25 : l: 1  = 'L2 = l 0. 190 0. 1 9 1  0. 148 0.230 
1t1 = 0.25 : l: 1  = I,  'L2 = 3l:1  0.264 0. 100 0. 1 1 8 0.2 1 9  
1t1 = 0.75: l: 1  = I, 'L2 = 3'L I 0. 134 0.098 0. 1 32 0. 149 
Standard error of the difference between the FACT means = 0.023 
Standard error of the difference between the LOA. QDA and CART means = 0.0 10 
Standard error of the difference between FACT and the other three methods = 0.018 
Table 4 . 1 3  shows that the error rates for CART when p = 2 ,  were more affected by an 
increase in 8 than when p = 6, while for FACT, the opposite effect occurred.  CART did best 
on both occasions when p = 2 while QDA produced the lowest average error rates when 
p = 6 . 
87 
88 
Table 4.13: Means and s tandard errors of the difference in  means of the cross?
val idation e rror rate estimates for each classification method with 
respect to the dimension-distance interaction (p * 8) 
Level 
p = 2 ,  
p = 2 ,  
p = 6, 
p = 6,  
------?- ?? 
- ?? ?  
8 = 2  
8 = 3  
8 = 2  
8 = 3  
LDA 
? ?- ? ?  
- ? -
---
? - ?-
- - -
? ?
??--
- ?
?? -
? 
0.222 
0. 175  
0.207 
0. 1 3 8  
. .  _9??-----
0. 1 72 
0.085 
0. 174 
0.084 
Standard error of the difference between the FACT means = 0.020 
CART FACT - - -
-
-
0. 1 42 0. 1 88 
0.079 0. 1 60 
0. 1 8 1  0.268 
0. 1 57 0. 1 94 
Standard error of the difference between the LOA, QOA anti CART means = 0.009 
Standard error of the difference between FACT ;mtl the other three methods = 0.0 16 
Table 4. 14  shows that the error rates for LDA were affected more by f(. )  when p = 2 than for 
p = 6,  whereas for FACT when p = 2 .  the error rate increased if f(. )  was lognormal rather than 
normal, but when p = 6, the error rate decreased if f(. )  was lognormal. QDA did best except 
when p = 2 and f(. ) was lognonnal where CART produced the lowest mean error rate . 
Table 4 . 1 4 :  Means and standard errors of the di fference in means of the cross-
validation error rate estimates for each classification method with respect to 
the dimension-distribution interaction (p*f( . )) 
Level 
p = 2 ,  
p = 2 ,  
p = 6,  
p = 6,  
LOA - -- -- - - -?--- --- -?- --
f(. )  = normal 0. 1 1 2 
f(. )  = lognormal 0.285 
f( . )  = normal 0. 1 30 
f(. )  = lognormal 0.2 1 5  
Q_J?.?- - --? 
0. 100 
0. 156  
0. 1 1 2  
0. 1 46 
Standard error of the difference between the FACT means = 0.020 
CART FACT 
0. 1 1 1  0. 15 1 
0.1 1 1  0. 1 97 
0 . 1 69 0.249 
0. 1 69 0.2 1 3  
Standard error of the difference between the LOA, QDA and CART means = 0.009 
Standard error of the difference between FACT and the other three methods = 0.016  
- -
Table 4. 1 5  shows that f(. )  had a different effect on  LDA, QDA an_d FACT when 8 = 2 
compared to 8 = 3 .  The most interesting finding was for FACT where the error rates for 8 = 2 
were lower when f(. )  was lognormal rather than normal, but when 8 = 3 the reverse effect 
occurred. QDA produced the lowest error rate except when 8 = 2 and f(. )  was lognormal 
where CART did best. 
Table 4.15: Means and s tandard errors of the difference in means o f  the cross?
val idation e rror rate estimates for e ach classification method with 
respect to the the distance-distribution interaction (8 * f(. ) )  
Level -- ??- ?- ? - . ---- . 
8 = 2 ,  f( .)  = normal 
8 = 2, f(. )  = lognorrnal 
8 = 3, f( .)  = normal 
8 = 3, f(. )  = lognorrnal 
LDA 
0. 160 
0.269 
( ) .()82 
0.230 
Q_Q? 
0. 1 39 
0.207 
0.073 
0.096 
CART FACT . - - -- ----- - - - -- -- -
0. 1 62 0.238 
0.162 0.2 1 9  
0. 1 1 8 0. 1 62 
0. 1 1 8 0. 1 92 
Standard error of the difference between the FACT means = 0.020 
Standard error of the difference between the LDA, QDA and CART means = 0.009 
Standard error of the difference between FACT ;md the other three methods = 0.0 16  
Table 4 . 1 6  shows that a l l  methods except CART were affected by the f( . )  * e interaction. 
QDA did best on all occasions where I.1 :I= I2, while LDA did best when f( . )  was normal and 
I1 = I.2 and CART did best when f(. )  was lognormal and I1  = L2-
89 
90 
Table 4.16: Means and s tandard e rrors of the d ifference in  means of the cross?
vali dation error rate est imates for each classification method with 
resp ect to the d istribution-priors-covariance interaction (f( . ) ? e) 
_L_e._v_e_l ____ ______ L_D_?- - _____ Q_!)A __ _ CART 
0. 178  
0. 1 24 
0. 148 
0. 1 1 8 
0. 1 32  
0.178 
0. 1 24 
0.148 
0. 1 1 8 
0. 132  
FACT 
Legend: 
f(. )  = normal, e = 1 
f(. )  = normal, e = 2 
f(. )  = normal, e = 3 
f(. )  = normal, e = 4 
f(. )  = normal, e = 5 
f(. )  = lognormal, e = 1 
f(. )  = lognormal, e = 2 
f(. )  = lognormal, e = 3 
f(.) = lognormal, e = 4 
f(.) = lognormal, e = 5 
0. 1 20 
0. 1 1 2 
0. 1 20 
0. 1 1 3 
0. 1 4 1  
0. 1 87 
0.258  
0.26 1 
0.4 1 6  
0. 1 28 
0. 1 27 
0.087 
0. 132  
0.084 
0. 101 
0. 199 
0.097 
0.250 
0. 1 1 6  
0.095 
Standard error of the differc::ncc:: helween the:: FACT means = 0.032 
0.224 
0.20 1 
0.2 1 1 
0. 1 88 
0. 1 75 
0.226 
0. 1 77 
0.250 
0.250 
0. 1 23 
Standard error of the difference belween lhe LDA. QDA and CART means = 0.0 14 
Standard error of the difference belween FACT and lhe olher lhree methods = 0.025 
e = 1 :  1t 1  = 0.5 : :I: 1  = :I:2 = I 
e = 2 :  1t1 = 0.5: :I: 1 = I, l:2 = :u: I 
e = 3 :  1t 1 = 0.25:  :I: 1  = :I:2 = I 
e = 4 :  n1 = 0.25 : l: 1 = I , l:2 = 3:I: I 
e = 5: 1t1 = 0.75 : l: I = I , :I:2 = 3l: 1  
From Tables 4.8 to 4. 1 6  i t  can b e  seen that on most occasions the average error rate for FACT 
exceeded all others. 
ANOV A's  were also calculated for the size of the decision trees from CART and FACT on 
the experimental factors. On average, a CART tree contained 5 . 1 8  terminal nodes with only 
increasing p and/or the p * e interaction having any significant int1uence on the size of the 
tree. The average FACT tree contained 4.4 1 terminal nodes, but depended on many factors. 
For both CART and FACT, various rules were used to limit the size of a tree to less than ten 
terminal nodes involving either increasing the size below which a node cannot be split for 
both CART and FACT, or selecting the smallest tree within b standard errors, 0 ::;  b ::;  2 ,  of 
the tree with the smallest cross-validated error rate, where the standard error of the 
misclassification cost is as given in Section 4 .3 .  
4.4.3 Summary and Discussion 
In this study,  it was found that either increasing dimension or decreasing sample size had the 
effect of increasing the error rate for CART and FACT while increasing dimension decreased 
the error rate for LOA. Neither d imension nor sample size had any real effect on the error 
rates for QDA nor did sample size for LDA. This was most probably because neither the 
sample size nor ratio of dimension to sample size were set low enough to seriously affect the 
error rates from the above two methods. Increasing the distance between groups was found to 
significan tly decrease the error rate for all methods while using a d ata set that was 
lognormally rather than normally distributed significantly increased the error rate for all 
methods except CART where it had no effect at all. The error rates for LDA were affected 
most by the changes in the five-level factor while CART was affected least. Note, however, 
that only for sample size were there no significant interactions. It was also found that the size 
of a FACT tree depended on many different criteria in contrast to a CART tree. 
This study has also shown that CART performs better on average than the other three 
methods when either the distribution was lognormal, dimension was small or the distance 
between groups was small, as well as when there were equal covariance matrices but unequal 
priors. On all other occasions, QDA produced the lowest average error rate except in the 
equal priors, equal covariance case, where LDA did best and when sample size was large 
where the differences between QDA and CART were negligible. 
For the four most significant first order i nteractions, QDA did best on average when the 
distribution was normal, except when the covariance matrices were equal. CART did best on 
most occasions where the distribution of the data set was lognormal. 
At this stage, it would be desirable to tie together the results from both Sections 4.3 and 4.4, 
in order to provide some reasons for the d ifferences in results. In Section 4 .3 ,  normally 
distributed data was compared with lognormal data with the means and covariances (thus S 
values) being exactly as specified in (4.3 . 1 )  and (4.3 .2)  by standardising the data. In Section 
4.4, however, no standardisation was undertaken resulting in means and covariances grossly 
9 1  
92 
different to the values used in Section 4.3. Consider the simple case of p = 1 .  If X - N(Jl, cr2) 
and Y = ex (that is, lognormally distributed) the values of E[Y1 and V[Y] , the mean and 
covariance of Y respectively, are defined as (Aitchison and Brown, 1957) :  
E[Y1 = exp(Jl + cr2/2) 
and 
Suppose that X 1 - N(O, 1 ) and X2 - N(2. 1 ). Let Y 1 = ex 1 and Y2 = ex2, then 
with 
E[Y d = e0?5 and 
E[Y 21 = e2 ?5 and 
giving 8 = 0.924. 
V[Y J l = e(e - 1 ) 
V[Y 21 = e\e - I )  
(4.4. 1 )  
(4.4.2) 
Therefore, the distance between populations has been reduced by 53 .8% from the case of 
normally distributed data. When the distance between populations is increased in the case of 
normally distributed data, the relative reduction in 8 caused by the exponential transformation 
increases correspondingly. 
From the results, i t  is obvious that methods which use l inear d iscriminant functions to form 
the classification rules (LDA and FACT) have lower error rates for lognorrnally shaped data 
when the covariance structure of the data is unal tered (see Tables 4.4. 4.5 and 4.6). When the 
data is transformed to be true lognormaL the means and covariances are drastically altered so 
that the distance between populations is reduced. This impl ies that the markedly increased 
error rates for LOA in this situation have been caused by a reduction in the separation 
between populations rather than the lognormally shaped data. For FACT, the increase in 
error rates is minimal . QDA, which models the individual class means and covariances 
separately, was affected in the same way by lognormally shaped data (increased error rates), 
no matter what the means and covariance structure of the data were. This also provides an 
explanation why there were no sign ificant priors-covariance structure main effects or 
interactions in Section 4.3 but that there were numerous such occurrences in Section 4.4. 
4.5 THE EFFECTS OF PRIORS ON ERROR RATES 
4.5. 1 Introduction 
This section gives details of a simulation study which was carried out using the data sets that 
were used in Section 4.4. The p ure lognormal was used rather than the standardised 
lognonnal so that comparisons could be made with the work of other authors (for example, 
Lachenbruch et al, 1 973 ) .  The purpose of the study was to c o mpare the group 
m isclassification error rates for the two parametric discrimination methods, LDA and QDA, 
and those of the two tree-based methods, CART and FACT, using both p riors proportional to 
sample size (PPSS) and equal priors. A comparison of the overall error rates using PPS S  and 
equal priors was also done. The resul ts are presented fol lowed by a discussion of the 
implications of the study. 
4.5.2 Purpose of this study 
Tests using several data sets have found that although the set of decisio n  rules produced by 
CART were about as accurate as those produced by L DA, the individual group 
m isclassification error rates tended to  be more variable for CART using PPSS. The most 
noticeable trend observed was that CART favoured the group(s) with the larger sample size(s) 
to the detriment of the group(s) with the smaller sample size(s). This occurred despite the 
fact that a larger proportion of observations from a class with a smaller sample size were sent 
one way than a class with a l arger sam ple size. However, because there were more 
observations from the class with the larger sample size sent the same way, the node was 
assigned to that class. This resulted in all the observations from the smaller class(es) being 
misclassified. Using LDA, however, in the case of PPSS ,  did not lead to such extremes of 
group m isclassifications that were encountered above for CART. This is because the group 
separation functions used in LDA to discriminate between two groups 
are only c hanged by one term, that is ln(rr/rrj) ,  when sample sizes are different. The rules 
tend to favour the class with the larger size but not to the same extent as those of CART, but 
if an observation is much closer to one samp le than another then that observation should still 
be allocated to that sample no m atter what the ratio of class sample sizes. 
93 
94 
M aking the priors equal had a v astly different effect on CART than LDA for m any o f  the 
examples carried out. The net result was that when the observations in the small  sample were 
even slightly m ore homogeneo us than those i n  the l arger sample, i n  relation to the 
discriminatory variables, then the splits were biased in favour of the small samp les. This 
often resulted in all the observations from a larger class being completely m isclassified, while 
most of those in the smaller samp les were correctly classified, leading to a large overall error 
rate and hence an inaccurate set of decision rules. As well, the decision rules created were 
totally different from those found using PPSS, with the variables strongly associated with the 
smaller sample(s) being split on .  It could be stated that altering the priors had somewhat 
reversed the group misclassification error rates for CART. W ith LOA, however, ln(1t/1tj) = 0 
so there was no bias due to sample size with discrimination based solely on the distance 
measure of one observation to a pair of sample means. 
This study was carried out to assess the performance of LOA and CART, as well as QOA and 
FACT, in correctly c lassifying observations in the simple case of just two groups. The study 
w as in tended to test whether CART is more susceptible to changes in the structure of the 
prior probabilities of group membership. Note, though, that a failure to show this trend will 
not necessarily invalidate what was conjectured before about CART' s sensitivity to changes 
i n  the structure of the priors, but instead that CART works adequately in  the c ase of two 
samples. 
4.5.3 Study Plan 
This study uses the eighty d ifferent bimodal probability models that were generated in 
Section 4.4. The prior probabilities for each of the two classes will be set to either equal or 
PPSS.  
As in the estimation of overal l  error rates in Sections 4.3 and 4.4, the group m isclassification 
error rates R(i/j ) ,  i ,  j = 1 ,  2 ,  i * j, were estimated using n-fold cross-validation. The 
resubstitution or apparent error rates were also calculated in the case of both PPSS and equal 
priors for LOA and CART only. 
Two measures of comparison of the group misclassification error rates were carried out. The 
first compares the difference between the two group error rates 
R ( l/2) - R (2/ l )  
It could be argued that the above is not a true measure of the performance of the individual 
methods over the possible range of error rates. For example, suppose that R(l/2) = 0.4 and 
R(21 1 )  = 0.35 .  This gives a difference of 0.05. In another set, R( 112) = 0.05 w hile R(21 1 )  = 0 
which also gives a difference of 0.05 . However, in the high error case the two group error 
rates are relatively similar, while those in the low error case are not. The straight difference 
between error rates is heavily weighted towards the data sets wi th the lowest error rates. 
Therefore, a second proposed measure of performance is the ratio of the two group error rates 
adj usted to avoid the possibility of invalid values in the cases w here R(21 1 )  = 0. 
(R( l /2) + 0.0 1 )  I (R(2/ l )  + 0.0 1 )  
4.5.4 Results 
The results of this study were analysed by a series of split-plot ANOVA's. (See Section 4.3 
for deta i l s . )  The first involved a com parison of the d i fferences between group 
misclassification error rates using n -fold cross-val idation for LOA, QOA, CART and tenfold 
cross-validation for FACT with PPSS. The results from the experimental factor stratum of 
the ANOV A are not useful for any of the analyses i n  this s tudy indicatin g  only if the 
experimental factors had any effec t  on either the difference between group error rates or the 
ratio of group error rates. The results of interest appear in the m ethod stratum .  
When comparing the differences between group error rates, i t  was found that the method (R) 
m ai n  effect (F = 393.57) was by far the most important effect. Seven method * factor(s) first 
and second order interactions were also significant at less than the 0.0 1% signi ficance level of 
the F-distribution with the R * f(. )  (method by distribution) interaction (F = 1 30.7 1 )  and the R 
* e (the method by priors-covariance structure) interaction (F = 9 1 .4) being the largest. The 
results could not be summarised in terms of either method by distribution or m ethod by 
priors-covariance structure as the second order interaction of R * f(.) * e (F = 9.9 1 )  was also 
h ighly significant. As a check on the assumptions for carryi ng out the ANOV A a plot  o f  
res iduals against fitted values revealed n o  dramatic trends with just a small number of unu sual 
observations. Box plots of residuals for each method showed variation to be relatively s imilar 
for LOA, QOA and CART although the variabil ity for FACT did appear to be m uch l arger 
than the other three m ethods. A weighting using the inverses of the mean square errors from 
the ANOVA's of the differences i n  error rates on the e xperimental factors was employed. 
95 
96 
Previous analyses where this weighting was carried out had shown that the results were not 
drastically affected. 
Figures 4. 1 a and 4. 1 b give the mean absolute differences b etween group error rates for the 
five levels of e, using the data sets that were normal ly  and lognormal ly  d istributed 
respectively, using PPSS.  The ideal si tuation is where R ( l /2) - R(21 1 )  = 0. Figure 4. l a  shows 
that all methods produced relatively similar error rates in the i deal equal p ri ors, equal 
variance case. W hen the covariance structure of the data was changed, it w as LOA that 
suffered the most, with m ore observations from the cl ass with the l arger variance being 
m isclassi fied . When one looks at the three cases where priors were not equal then it is 
noticeable that the two paramet1ic methods were least a ffected while the two tree-based 
methods, FACT more so than CART, were m ost affected , by misclassifying a larger number 
of observations from the c lass with the smaller sample size. 
Figure 4. 1 b shows in the c ase of lognormal data that the m ean differences for CART were the 
same as in Figure 4. 1 a, as would be expected since one of the properties of CART is its 
invariance to monotone transformations of the variables. LOA has done better than CART in 
the unequal p riors, equal covari ance case but worse when covariances were unequal. The 
mean values for R ( l /2) - R(21 1 )  using LOA for e = l and e = 3 were very similar as were 
those for e = 2 ,  e = 4 and e = 5 implying that it was the difference in covariance values that 
caused the large d isparity between group error rates when using LOA rather than the fact the 
data sets were lognormally d istributed. FACT was the most affected by lognormal d ata when 
e = 3 and e = 4, that is when the class with the smallest sample size had the largest variance, 
but was less affected when e = I ,  2 or 5. As in Figure 4. l a, QOA was least affected by 
changes in the p rior-covariance structure of the data. In general, QOA produced differences 
i n  m isclassification errors which were usual ly larger when f( . )  was lognormally rather than 
n ormally distributed but those differences were not as great as those using LOA. These 
results confirm those of Lachenbruch et al ( 1 973) where a very similar set of parameters were 
used. 
comparison of the Difference between Group Misclassification 
Error Rates using LDA, QDA, CART and FACT with Priors 
proportional to Sample Size: (a) Normally Distributed Data 
,_ .... 
' 
N ....; 
0.6 
0 .5 
0.4 
? 0.3 
,_ 
N ' ... 
"' 0.2 
c 
c 
-? /o- ?- - - ?- - -o- - - -- - - - -
- -0 
2 3 4 
e 
METHOD ?-?-o CART c c c FACT ? LDA ,._.,..  QDA 
Figure 4. 1a  
Comparison of the Ratio of Group M lsclasslflcat!on Error 
Rates using LOA, QOA, CART and FACT with Priors 
Proportional to Sample Size: (a) Normally Distributed Data 
...... ... 
, 
5 
' 
? 4  a: ? 
? 3  "" ' ..._. I a: / 
? 
/ / 
.R \  
/ /  \ ... \ 
,? \ \ 
, .... _ ,? ..
-? 
\ 
\ 
\:, 
\.\ 
.... . , 
..... ?, 
5 
\0 
p 
I ?--------2?-------3?-------4r-------?5 
e 
lr!EfHOD .....  CART c c c FACT - LDA 
r? I 'grre 4 .2a 
o- o-o QDA 
(b) Lognonnally D istributed Data 
1 .0 
0.9 
0.8 
0.7 
....... 
" 0.6 
C\J ? 
0:: 
I 0.5 ....... 
C\J ......... .::.. 0 .4 
0:: 
0.3 
0.2 
0 . 1  
/ 
....
. /
? 
.? 
c 
.  
- - - . _  --- //?-- ----0. -?e-:F-
c 
___________ .,
? 
??.... . . ,. .. '()' 0.0 
2 3 
e 
METHOD . .. ..  CART c c c  FACT - LDA 
Figure 4 . lb  
(b) Lognonnally distributed Data 
7 
6 
5 
2 
0 
4 
1 2 3 4 
e 
METHOD ....... CART c c c FACT -LDA 
Figure 4 .2b 
<>- o-o QDA 
o- o- a  QDA 
5 
5 
98 
The second analysis compared LOA, QDA, CART and FACT using the ratio of the two 
group error rates. The resul ts of the split-plot ANOV A show that as in the previous analysis, 
the R main effect was by far the most important effect (F = 1 64. 1 2) wi th the R * f(.) 
in teraction (F = 7 1 .8)  also being important .  As with the previous analysis, the R * f(.) * e 
(F = 2 1 . 1 4) interaction was significant so results will be presented in  terms o f  these three 
factors. Analysis of the residuals shows that there was rather a funnel-like pattern among 
them, implying that the variation of residuals was not constant. Increasing the fitted value 
increased the variation of the residuals. Box plots of the residuals for each method showed, as 
in  the previous analysis, that the mean square enor for FACT was much  larger than that for 
the other three methods. 
Figures 4.2a and 4.2b give the mean ratios of the group error rates using the adjustment factor 
mentioned earlier. The ideal situation in this case is where R(l/2)/R(2/ l )  = 1 .  It is noticeable 
that the trends observed are similar to. yet somewhat different from those in Figures 4. 1 a  and 
4. 1 b. Figure 4.2a shows that in the case of f( . )  being normally distributed both the LOA and 
QOA error rates were the most stable of all the methods over the g iven priors-covariance 
levels. CART performed very badly in the unequal priors case while the results for FACT 
when e = 3 and e = 4 were omitted from the graph due to the excessively high values 
recorded. 
Figure 4.2b shows that in the case of lognormal data. QDA was the best method. It must be 
remembered. as noted in Section 4.3. 1 .  the va1iance for the second class where x2 > 0 will be 
substantially larger than that for the first class where x 1 = 0. Therefore, the quadratic rules 
should be expected to work better than the linear rules. LOA has perf01med badly again but 
as in Figures 4. l a  and 4. l b, LOA has performed consistently poorly, whereas the ratio of 
priors for CART was affected mostly by sample size. The results for FACT were not really 
worth quoting due to the excessively high ratios of en?or rates. 
In the third analysis. it was decided to compare the mean difference between both the cross?
validation and apparent group error rates using CART and LDA, but this time using equal 
priors. The results . from the method stratum again show the R main effect (F = 222.46) to be 
most important with the R * f(.) interaction (F = 1 30.8) also being very important. 
The fourth analysis compared the CART and LOA apparent  and cross-validation group error 
rates using the ratio of group error rates as the measure of performance. The results fro m  the 
ANOV A show rather different trends to those exhibited in the previous analysis with the 
m agnitude of the F-ratios h aving dramatically decreased. For instance, for the R m ain effect, 
the F-ratio was only 2.43, which not significant at even the 5% level of significance. This 
indicates that differences between the group m isclassification error rates did indeed increase 
with the total error rate so that the transformations were not reall y  necessary here though the 
analysis is included for completeness. 
B oth analyses showed that the R * f(. )  * e second order interaction was highly significant so 
that results are presented i n  terms of these three factors as in the first two analyses. Figure 
4.3a shows that in the case of the data sets that were normally distributed, the apparent group 
error rates were very similar but there was a large difference between the cross-validated 
group error rates, with the class having the larger sample size having the substantially larger 
error rate. In contrast, the apparent and cross-validated error rates for LOA were very close 
together, bein g  affected m ost by the change in covariance structure rather than sample size. 
This is illustrated by the error rate for class 2 being l arger than that for class I when both 
e = 4 and 5. 
Figure 4.3b shows a similar pattern to Figure 4.3a except that the differences in  group error 
rates for LOA have greatly increased using lognorm ally rather than normally distributed data 
sets. These patterns are further exemplified in Figures 4.4a and 4.4b. It has thus been shown, 
in the case of equal priors, that class sample size is the m ain factor in determining group error 
rates for CART when cross-validation is done, though this problem does not m anifest itself 
when used to calculate the group error rates from the learning sample. Sample size, however, 
as expected, had no influence on the group error rates for LOA, in the case of equals priors 
which were instead influenced by the covariance structure and distribution of the data set 
99 
Comparison of the Difference between Cross-Validation and 
Apparent Group Mlsclasslflcatlon Error Rates using LOA 
and CART with Equal Priors: (a) Nonnally Distributed Data 
0. 1 5  
0. 14  
0. 13  
0. 12 
0. 1 1  
-::- o. 1 o  
' .?? 0.09 -
? 
I 0.08 ? 
? 0.07 
a: 0.06 
0 .05 
0.04 
0 .03 
0.02 . ? 
/-c-
.---
- -?0 
? ???? ??? ?? 
0 .01  ------
-
-- -+-------- -?--?-??--?? 
3 4 
e 
METHOD ????- CART.AP rc-c? CART.CV 
Figure 4.3a 
? LDA.AP o-o o LDA.CV 
5 
Comparison of the Ratio of Cross -Validation and Apparent 
Group Misclassification Error Rates using LOA and CART 
with Equal Priors: (a) Normally Distributed Data 
.... 
' 
CV 
a: 
3 
::::::: 2 - -- - --- - --- - - -- ?- -- - - -
---c--, 
'? ? ', V,/ 
0 
3 
----- _./ 
-------.------
0 . . 
4 
e 
_ ... . .... 
,/ 
' ? 
/ / 
_./ 
METHOD ....... CART.AP c-c-c? CART.CV 
Figure 4.4a 
? LDA.AP o ..,. o LDA.CV 
_ ... 
'? 0 
t 
5 
(b) Lognormally Distributed Data 
0.6 
0.5 
? 0.4 
' 
CV 
Pr 
I 0.3 
CV ' 
? 0.2 
- - - - - --? - - --
-- --- e 
0.1  ---
--
-
-
-
?{: - -
0.0 _,_, _, _
__ ,_,_, _, _, __ -?- -??? ,_,_ .. _ .. ___ ....... - .. .. ---. 
3 4 
e 
METHOD ? ?-- CART.AP c-c-c? CART.CV 
- LDA.AP ? o o LDA.CV 
Figure 4 .3b 
(b) Lognormally Distributed Data 
4 
3 
__ .
/ .// 
-
? 
--- ? 
-? -------- .. -?- ?- ?- -- -c-., ?, ______ ./_/ 
?,,,_
V/. 
---- ----- __ /
/_ 
.. ?? '
' 
' 
---------? 
5 
........ 
or-----------??----------? 
3 4 
e 
METHOD ..... . CART.AP c-c-c- CART.CV 
Figure 4.4b 
- LDA.AP ., . ..,. o LDA.CV 
5 
The fifth analysis compared the CART and LDA cross-validated group error rates using the 
difference between error rates as the measure of performance, w hile the sixth analysis used 
the ratio of error rates, in the case of both equal priors and P PSS. As i n  the previous 
analyses, the R * f(. )  * e interaction was very important so the results will be presented in 
terms of those three factors. Figures 4.5a and 4.5b show that choice of priors did not affect 
the absolute difference in group error rates, no matter what the distribution of the data set As 
in the previous analyses, the group error rates for LOA were most affected by change in the 
covariance structure of the data set. With CART, a different pattern emerges. The use of 
equal priors rather than priors proportional to sample size has meant a reduction in the 
difference between group error rates. The greatest influence on individual group error rates 
for CART was sample size. Figures 4.6a and 4.6b confirm the trends mentioned above. 
A next step in the analysis was to compare the overall LOA and CART error rates using equal 
priors. As with the group misclassification error rates, the results were analysed using a split?
plot ANOV A. The ANOV A showed that the R main effect (F = 1 1 1 . 8) was highly 
significant as were all the R * f(. )  interactions and a number of second order interactions. 
The two m ost important of these were the R * p * n (F = 8 .43) and R * p * f(. )  (F = 6.47) 
interactions. 
Figures 4.7a and 4.7b compare the n-fold cross-validation and apparent error rates over all 
combinations of dimension and sample size using equal priors. The two graphs suggest that 
there was a large difference between the R(CV) and R(A) error rates in CART for smaller 
samples, indicating the bias of the latter estimator in such situations. This bias was only 
minimal for LOA. Considering the cross-validation estimates only, CART did best when p = 
2 and LOA when p = 6. 
Figures 4.8a and 4.8b illustrate the situation of the p * f(. )  interaction. As in Figures 4.7a and 
4.7b, there was a large discrepancy between the R(CV) and R(A) error rates for CART, 
which did not arise for LOA. Looking at the R(CV) estimates shows that LDA did best for 
normal data and lognormal data when p = 6, while CART had the lowest error rate when 
p = 2 and the data was lognormal, as distinct from the situation of PPSS w here CART did 
best for lognormal data no matter what the number of vaxiables in the data set. 
101 
Comparison of the Difference between Cross-Validation Group 
Mlsclasslflcatlon Error Rates using LOA and CART with both 
Equal Prlora and PPSS: (a) Nonnally Distributed Data 
0.34 
0.32 ?, ?,., 
-
0.30 
0.28 
0.26 
0.24 
.:::. 0.22 
? 0.20 P:: J.. 0 . 1 8  
? 0 . 16  
.::: 0. 1 4  
P:: 
0 . 1 2  
0 . 1 0  
0.08 
0.06 
0.04 
0.02 
3 
??, .......... ' ', 
?
, 
---
c -
-??- ? 
_ __ _  _.--
4 
e 
____ .. 
METHOD ?-?- CART.EP 
? LDA.EP 
c-c-c CART.PPSS 
o " ? LDA.PPSS 
Figure 4.5a 
- ? (. 
... -e 
. 0  
5 
Comparison of the Ratio of Cross-Validation Group 
Misclassification Error Rates using LOA and CART with 
both Equal Priors and PPSS: (a) Normally Distributed Data 
7 
..... _ 
6 ?., 
5 
'? 
' ', '? 
' ' 
2 ---- - --??-????- ?? 
0 
3 
METHOD ?-?- CART.EP 
? LDA.EP 
Figure 4.6a 
4 
e 
c-c-c CART.PPSS 
o o- ? LDA.PPSS 
5 
(b) Lognormally Distributed Data 
0.56 
0.54 
0.52 
0.50 
0.48 
0.46 
0.44 
0.42 
- 0.40 .:::. 0.38 
N 0.36 
p: 0.34 
I 0.32 ' 
N' o.3o 
" 0.28 
.::: 0.26 
P:: 0.24 
? ...... _ _ __ .- ?-? 
__ __..-4 
0.22 
0.20 
0. 18  
0. 1 6  
0. 1 4  
0. 1 2  -- -------
-
0. 1 0  . - -? -? 0.08 
-----
3 
METHOD ?- ? CART.EP 
- LDA.EP 
Figure 4 .5b 
(b) Lognormally Distributed Data 
7 
6 
5 
2 
' . ' . 
------- ?-
4 
e 
c-c-c- CART .PPSS 
. ..  o LDA.PPSS 
'? ' ' ' ?. 
5 
0?--------------?--------------? 
3 
METHOD ....  CART.EP 
- LDA.EP 
Figure 4.6b 
4 
e 
c-c-c? CART.PPSS 
O?C? -a IDA.PPSS 
5 
mparlson of the Cross-Validation and Apparent 
JsCiaSslflcatlon Error Rates using LOA and CART 
Equal Priors: (a) Dimension= Two 
0.20 
0. 19 1 ---?=::::::-:-?-:::::::::-:=--::-=? 
0. 18 ? ?
?O 
0. 17 
0. 16  
0. 15  
0. 14  " 
-;; 0. 13  
g: ... 0. 12 0 t 0. 1 1  
I
l-l 0. 10  
0.09 
0.08 
0.07 -
--
-
-
-
-
-
--
?
??
?
? 
------- ??? 
-
0.06 
0.05 
0.04 
--
-
---- --
-
-
60 
METHOD 
I 
Figure 4 .7a 
Sample Size 
???? CART.AP c-? CART.CV 
? LDA.AP ? o o LDA.CV 
- -? ? 
300 
Comparison o1 the Croa.s-Validation and Apparent Mlsclasslflcatlon 
Error Rates using LOA and CART with Equal Priors and 
Different Dimension: (a) Normally Distributed Data 
0.25 
0.24 ,.e 
0.23 ? ' 
' 
0.22 ? 
0.21  
0.20 ' 
0. 1 9  ' ? 
u 0. 1 8  
-;; 0. 17  
0: 0 . 1 6  
? 0. 15  . ? 
? 
' 
.. t!i 0. 14  0 
? 0 . 13  ' 
0 . 12  
0 . 1 1  . . . .. ? . ?? 
0. 1 0  
0.09 
0.08 
------- - --??-?-?- -
--. 0.07 
0.06 - -- -
2 
i METHOD Dimension .. ... CART.AP c-c-c CART.CV 
? LDA.AP ? . .., ? LDA.CV ?;gu<e < Sa 
6 
(b) Dimension =Six 
0.32 
0.30 
0.28 
0.26 
0.24 
0.22 
Ql 0.20 
.... 
"' 
0: 0 . 18  
... 
0 0. 1 6  
... 
... 0. 1 4  til 
0. 1 2  
0. 1 0  
0.08 
0.06 
0.04 
---
------ --
0.02 
60 
METHOD 
Figure 4 .7b 
-
'-. 
? ? ? ? ? ?-??? ??? ?? 
Sample Size 
.. .. CART.AP c-c-.:? CART.CV 
? LDA.AP ? .,  ? LDA.CV 
(b) Lognormally Distributed Data 
0.27 
0.26 
0.25 
0.24 
0.23 
0.22 
0 .21  
0.20 
Ql 0. 1 9  
.... 0. 1 8  
"' 
0: 0. 1 7  
... 
0 0. 1 6  
... 0 . 15  ... 
til 0 . 1 4  
0. 1 3  
0. 1 2  
0. 1 1  
0. 1 0  
0.09 
0.08 
--. 
c 
300 
0.07 
0.06 ? --- ----?---
-?--?-?-----??-
---? 
2 
METHOD 
Figure 4 .8b 
D imension 
. ... CART.AP c-c-.:? CART.CV 
? LDA.AP ., . .., . .? LDA.CV 
6 
1 04  
A final analysis was done to compare the overal l LOA and CART R(CV) estimates using 
both PPSS and equal priors. A split-plot ANOVA showed that the R main effect (F = 1 8.75) 
was n ot the most significant effect present. More important were the R * p (F = 39.37),  R * 
f(.) (F = 82.69) and R * e (F = 22.04) interactions, as well as the R * f(. )  * e second order 
i nteraction (F = 24. 1 5). The R * p * f(. )  interaction was also found to be highly significant (F 
= 6.49). Therefore, the differences between some of the methods/priors * factor interactions 
were more important than the differences between the methods/priors themselves. 
Figure 4.9a shows there was a relativel y  small difference in the LOA error rates when p = 2 
and no real difference when p = 6. For CART when p = 2, there was very l ittle difference in 
the error rates but when p = 6, the use of equal priors resulted in an increase of approximately 
0.09 from using PPS S. Figure 4.9b shows for lognorm al data, that there was a larger 
difference in the LOA error rates, with the equal priors case resulting in the lower error rate. 
Note though , that even in c ircumstances unfavourable to LOA, this difference was smaller 
than that between the CART error rates . 
From Figure 4. 1 0a, it can be seen that using equ ?.!l p1iors for LOA produced the lowest error 
rate except when rc 1 = 0 .7 5  and I. 2 = 32. 1 though the differences were relatively minor. 
Figure 4. 1 0b illustrates that for lognormal data. the same situation as above occurred though 
the differences were much larger. For CART, using PPS S produced the l owest error rate, no 
matter what the p riors-covariance structure of the data . 
4.5.5 Summary 
In this section, the group m isclassification error rates for LOA, QOA, CART and FACT were 
compared using both priors proportional to sample size and equal priors. Overall ,  the results 
showed that the i ndividual error rates for QOA were least affected by c hanges in the priors 
and covariance structure of the data with LOA being the second least affected by the above 
m en tioned alterations. CART was the least affected by using data that was lognormally 
rather than normally distributed , but was seYerel y  affected by changes in the p riors?
covariance structure of the data, misclassifying fewer observations from the c lass with the 
l argest sample size, as suggested by Brei man et al ( 1 984 ) .  This trend was even more 
apparent  for FACT. Lognormal data severely affected both parametric discrimination 
methods, LOA more so than QOA which supports the results of Lachenbruch et a1 ( 1973). 
As noted previously, a differen t  set of parameters was used in that study . It was stated in 
Comparison of the Cross-Validation Mlsclasslflcatlon Error 
Rates using LOA and CART with both Equal Priors and PPSS 
and Different Dimension: (a) Normally Distributed Data 
111 
0.25 
0.24 
0.23 
0.22 
0 .21  
0.20 
0. 1 9  
-;;; 0. 1 8  
0: 
.. 0. 1 7  
0 t 0. 1 6  _.{: 
w 0. 1 5  
0. 1 4  
0. 1 3  
0. 1 2  
0. 1 1  
0. 1 0  
0.09 
2 
Dimension 
METHOD ? ? ? - CART.EP 
- LDA.EP 
Figure 4 .9a 
,..,.., CART.PPSS 
o o o LDA.PPSS 
Comparison ot the Cross -Validation Mlsclasslflcatlon 
Error Rates using LDA and CART with both Equal Priors 
and PPSS: (a) Normally Distributed Data 
0.21  
0.20 
0. 1 9  
0. 18 
O. i 7  
111 -;; 0. 1 6  
0: 
? 0. 1 5  
0 ? r!i 0. 1 4  
.. ??? 
_ .... ???? 
.... ?? 
,..?? 
6 
_,.. 
0 
0. 1 3  
0. 1 2  
0. 1 1  
0. 1 0  
0.09 
- - ?-c 
3 
METHOD ????? CART.EP 
- LDA.EP 
re 4. 1 0a 
4 
e 
?c-c CART.PPSS 
? ? o? o LDA.PPSS 
5 
(b) Lognonnally Distributed Data 
0.3 1 
0.30 
0.29 
0.28 
0.27 
0.26 
0.25 
0.24 
., 0.23 .., 0.22 ., 
0:: 0.21  ... 0.20 0 ... 0. 1 9  ... 
w 0. 1 8  
0. 17  
0. 1 6  
0. 1 5  -?-
0. 1 4  
0. 1 3  .. ---
0. 1 2  
0. 1 1  
0. 1 0  
2 
Dimension 
METHOD ? - ?? CART.EP 
- LDA.EP 
Figure 4 .9b 
(b) Lognormally Distributed Data 
0.42  
0 .40 
0.38 
0.36 
0.34 
0.32 
., 0.30 .... 
0.28 ., 
0:: .. 
0 .. 0.24 ... 
w 
0.22 
0.20 
0. 1 8  
0. 1 6  
0. 1 4  
0 
?c.., CART.PPSS 
o o o LDA.PPSS 
- ?- -? 
6 
0. 1 2  -- ?c:? --- ?-?-?-?----??- - .-?---?? 
0. 1 0  
3 
METHOD ????? CART.EP 
?- LDA.EP 
Figure 4. 1 0b 
4 
e 
?c-c? CART.PPSS 
O?O 0 lDA.PPSS 
5 
106 
Section 2.3 that if  the ratio of class sample sizes to dimension is large, QDA works 
particularly well. In the majority of cases studied here, the above ratio was large so that QDA 
should be expected to work well. 
Using FACT on the lognormal data sets had little effect on the group error rates when sample 
sizes were equal, but, when one class was larger than the other, all observations from the 
class with the smallest number of observations were misclassified, except when that class had 
unit covariance. 
It was also found that there were large differences between the apparent and cross-validated 
group error rates for CART with the apparent error rates for each group being very similar, 
but the cross-validated error rates exhibited wide differences. With LDA, there were 
negligible differences between the apparent and cross-validated error rates. Using equal 
priors rather than priors proportional to sample size with CART decreased the difference 
between (or ratio of) group error rates but the differences were still larger than those using 
LOA on normally distributed data sets. 
It could be recommended, based on these simulations, that CART would be the preferred 
method when the data is lognormal, if the criterion used to judge a method performance is the 
group misclassification error rates. 
When comparing overall error rates using equal priors, it was found that in comparing the n?
fold cross-validation and apparent error rates for LDA and CART, the differences were very 
dependent on the interaction of dimension and sample size as well as dimension and 
distribution of the data. LDA did best in all the above situations, except when there was only 
a low number of variables with lognormal data. 
In comparing the error rates found from using priors proportional to sample size and equal 
priors for LDA, the differences between the two error rates were minimal for normal data, 
though rather large for lognormal data. The choice of priors affected overall error rates more 
for CART except when the data was lognormal. 
The results from this section, even in the simple case of two populations, point to CART 
being very sensitive to the proportion of observations from each class in the sample. The 
results have shown that the objective of CART is to optimise the overall error rate at the 
expense of the respective group misclassification error rates. When n1 = 1t2, there is no such 
problem, but when one of the class sample sizes is small, then CART will tend to correctly 
classify as many observations as possible from the largest class at the cost of misclassifying 
many or most observations from the smallest class . 
It could thus be recommended based on these simulation results, that LDA (or QDA) is 
preferred in the case of disparate sample sizes. If  CART is used, some caution should be 
shown when interpreting the results. 
4.6 SIMULATION STUDY m 
4.6. 1 Introduction 
Sections 4.3 ,  4.4 and 4.5 compared the accuracy of the four classification methods in the 
setting of multivariate continuous data. This section compares the reliability of n-fold cross?
validation error rate estimators for each of LOA, QOA and CART in estimating the actual 
error rate. The predictive ability of FACT was shown to be particularly poor, especially in 
non-ideal situations, hence further consideration of this method will not be done. 
Numerous papers including Efron ( 1 983), Hand ( 1 986) and McLachlan ( 1 986, 1987) have 
noted the n-fold cross-validation estimator of en?or rate has a large variance, especially when 
n is small ,  when used with LOA. These results were echoed by Crawford ( 1 989) when used 
with CART. Hence, the reliability of the n-fold cross-validation error rate estimators is 
compared to a number of other en-or rate estimators that were introduced in Section 4.2, both 
within and across the three methods. 
4.6.2 Study Plan 
The same probability models that were used in Section 4.4 will be used here to compare and 
assess the reliability of each error rate estimator, across and within the three methods, in  
approximating the actual error rate, R(T). The error rate estimators that were considered for 
LOA and QDA are the n-fold cross-validation estimator, R(CV), the apparent estimator, 
R(A), the rotation estimator, R(ROT), and the 0.632 estimator, R(0.632), using 
107 
108 
R(0.632) = 0.368 * R(A) + 0.632 * R(ROT) 
as this has been shown to be very similar to using R(?), instead of R(ROT), the average error 
rate for all observations not included in the bootstrap sample. 
For CART, the R(CV) and R(ROT) estimators were used as for LDA and QDA, while a ten?
fold cross-validation error rate estimator, R(TEN), was also calculated. S ince the tree chosen 
for each of the probability models by the above three error estimation techniques may be of 
different size and hence have different apparent error rates, three different apparent error rates 
were calculated . These were associated with each of the above three error estimation 
techniques and given by, R(ACV), R(AR) and R(AT) corresponding to n-fold cross?
validation, rotation and ten-fold cross-validation respectively. In addition,  the R(0.632) 
estimator was calculated in the same manner as for LDA and QDA with tree size chosen by 
R(ROT). 
For all methods, a test sample of size 5000 was used to g ive an accurate value of R(T) and in 
the case of CART, to choose the right-sized tree. The performance of each method was 
determined by the mean square en?or criterion 
MSE = E[RCD - R(T)]2 
where R(i) is the particular error rate estimator. The MSE criterion provides a measure of 
both bias and variability of an error rate estimator. As in Sections 4 .3 ,  4.4 and 4 .5 ,  a split?
plot ANOV A was used to analyse the results of the simulations. 
4.6.3 Results 
The first analysis looked at  the error rate estimators for only LOA and QDA. The R main 
effect (F = 8 .  7 1 )  and six R * factor interactions were significant at the a = 1 %  level (and 
indeed at the a = 0. 1 %  level). The largest among these were the R * f( . )  (F = 1 2.07), R * e 
(F = 7 .55) and R * f(.) * e (F = 6.00) interactions. Of the first order interactions, only the 
R * 8 interaction was not significant showing that for both LDA and QDA, all error rate 
estimators produced fairly similar results, no matter what the distance between populations. 
Note, though, that when 8 = 2, R(B) = 0. 1 59 and when 8 = 3, R(B) = 0.067, hence the 
probability models studied here were for fairl y  well separated populations. Ganeshanandam 
and Krzanowski ( 1 990) found the R * 8 interaction to be highly significant although they 
used 8 = LO 1 and 8 = 2.53 (R(B) = 0.29 1 and R(B) = 0. 1 03), thus based their results over a 
much wider range of Bayes error rates. The results given here, though, were comparable with 
those of Fitzmaurice et al ( 1 99 1 )  for R(B) = 0.05 and 0. 15 .  
When comparing the seven error rate estimators for CART, the R main effect (F  = 50.76) and 
seven R * factor interactions were significant at the a = 1%  level. Of the first order 
interactions, only the R * f( . )  interaction was not significant, which was as expected, given 
that CART is robust to non-normality of the variables. The results showed though that the 
CART error rate estimators were very sensitive to the choice of the number of variables, 
sample size, Mahalanobis distance between populations and the priors-covariance structure of 
the data. 
A final analysis was done to compare all the error rate estimators over the three methods. As 
expected, with so many error rate estimators over different methods, almost all the R * factor 
interactions were significant at the a = 1 %  level. A personal correspondence from David 
Hand suggested that this approach may be infeasible if there were different residual variances 
among the method-error rate estimator combinations. A plot of the residuals against each of 
the method - error rate estimator combinations showed that the assumption of equal residual 
variances was not really valid. As suggested by Hand, a weighting of the MSE' s for each 
method error rate estimator combination was carried out, using l is? as the weights ,  where s ; 
is the variance of the MSE's for the ith method-error rate estimator combination. The results, 
however, showed a number of differences from the unweighted analysis, in that a few less of 
the R * factor interactions were signiticant, but in the main, the important effects identified in 
the unweighted analysis showed up in the weighted analysis. However, the magnitude of the 
F-ratios can sti l l  be used to indicate which were the most important effects influencing the 
performance of the various error rate estimators across the three methods. 
Tables 4. 1 7  to 4. 1 9  show the mean values for the most important second order interactions, 
those being R * n * 8 (F = 5.79) and R * f(. )  * e (F = 4.37). Table 4. 1 7  shows that most 
estimators were more reliable (that is, had lower mean square error) when 8 = 3 than when 8 
= 2 for smaller sample size. A notable exception to the rule was R(0.632) for CART which 
confirms what was shown in Crawford ( 1 989).  R(0.632) for LOA had lower mean square 
error for 8 = 3 than for 8 = 2, which confirms a trend shown in Fitzmaurice et al ( 199 1 ). 
109 
1 10 
When n = 300, some of the CART error rate estimators were more reliable when 8 = 2 than 
when 8 = 3 .  Of the R(CV) estimators, QOA and CART were the most reliable while those 
for LOA were rather unreliable, due most probably to the large variability of the estimator. 
Overall, R(0.632) for CART did best when 8 = 2, no matter what the sample size. When 
8 = 3 ,  R(CV) for CART did best when n = 60 and R(TEN) for CART did best when n = 300. 
Table 4.17: Average mean square errors (MSE's) for different error rate estimators 
using LDA, QDA and CART with respect to the sample size-distance interaction 
(n * 8) (* Io-5) 
n = 60 n = 300 
Method Error Rate Estimator 8 = 2  8 = 3  8 = 2  8 = 3  
LOA R(CV) 708 344 504 424 
R(A) 564 27 1 492 4 1 7  
R(ROT) 427 525 484 366 
R(0.632) 400 375 483 3 8 1  
QOA R(CV) 1 89 I l l  84 39 
R(A) 278 86 70 33 
R(ROT) 773 533 57 34 
R(0.632) 366 1 74 49 3 1  
CART R(CV) 409 51 57 10 1  
R(ACV) 2869 763 36 293 
R(ROT) 1 85 234 58 56 
R(AR) 1 238 687 22 1 224 
R(0.632) 106 1 1 8 23 74 
R(TEN) 483 477 66 24 
R(AT) 1 532 850 262 1 84 
Table 4. 1 8  shows that the error rate estimators using LOA and QOA were generally more 
reliable than those for CART, for normally dist1ibuted data. Of the R (CV) estimates, LOA 
and QOA were the most reliable except when n1 = 0.75 and I.2 = 32.1 (e = 5) where CART 
did best and LDA especially fell down. The R(0.632) estimator for CART performed 
uniformly wel l  over all five levels of factor e while the other estimators for CART were more 
variable. The R(ROT) estimate for CART did best overall in the equal covariance, unequal 
priors case (e = 3) .  
Table 4.18: Average mean square errors (MSE's) for different e rror rate estimators 
using LDA, QDA and CART with respect to priors-covariance structure (e) and 
normal data (* 10-5). 
Method Error Rate Estimator e = 1 e = 2  e = 3  e = 4 e = 5  
LOA R(CV) 60 49 1 09 1 52 440 
R(A) 108 20 64 1 1 2 268 
R(ROT) 18 97 33 1 1 40 374 
R(0.632) 29 52 1 79 9 1  3 1 5 
QOA R(CV) 87 55 173  39 1 94 
R(A) 4 1 1  1 33 83 87 39 
R(ROT) 232 294 704 197 940 
R(0.632) 237 104 237 56 372 
CART R(CV) 1 2 1  1 82 1 92 204 74 
R(ACV) 1422 1203 970 1 064 699 
R(ROT) 1 1 4 1 75 56 1 77 1 45 
R(AR) 1 027 42 1 482 605 427 
R(0.632) 102 47 79 10 1  72 
R(TEN) 434 297 1 87 268 1 26 
R(AT) 1 247 65 1 409 7 10 5 1 9  
Legend: 
e = 1 :  1t1 = 0.5 : :L 1 = l:2 = I 
e = 2: 1t 1 = 0.5 : l: 1 = I, l:2 = 3:L t 
e = 3 :  1t 1 = 0.25: l: 1  = :L2 = I 
e = 4: 1t 1 = 0.25: L I = I , :L2 = 3:L t 
e = 5 : 1t 1 = 0.75 : l: 1 = I , :L2 = 3:L t 
Table 4. 1 9  shows the error rate estimators for the lognormally distributed data. The general 
trend observed was that the estimators for LOA deteriorated markedly in the case of unequal 
variances, but in the equal covariance case, the estimates closely approximated the actual 
I l l  
1 12 
error rate. The estimators for QDA remained relatively constant for all levels of e and not too 
dissimilar from the normal data situation. Naturally, the CART estimators were exactly the 
same as in the normal data situation. Of the R(CV) estimators, LDA did best in the case of 
both equal priors and covariances, CART for unequal priors but equal covariances and QDA 
elsewhere. The R(0.632) estimate for CART had consistently low mean square error for all 
levels of e. 
Table 4.19: Average mean square errors (MSE's) for d ifferent error rate estimators 
using LDA, QDA and CART with respect to priors-covariance structure (e) and 
lognormal data. 
Method Error Rate Estimator e = 1 
LDA R(CV) ltl 
R(A) 50 
R(ROT) "' 7  _) _ 
R(0.632) ltl  
QDA R(CV) 43 
R(A) 62 
R(ROT) 19 1  
R(0.632) 7 1  
1 2 1  
CART R(CV) 
Legend: 
R(ACV) 1422 
R(ROT) 1 1 4 
R(AR) 1027 
R(0.632) 102 
R(TEN) 434 
R(AT) 1 247 
e = 1 :  1q = 0.5: :I: 1 = :I:2 = I 
e = 2 :  1q = 0.5:  :I: 1 = I, :I: 2 = 3:I: I 
e = 3 :  1t 1 = 0.25: I: 1  = :I:2 = I 
e = 4: 1t 1  = 0.25: I:1  = I, I:2 = 3:I:I 
e = 5 :  1t J  = 0.75 : I: 1  = I, I:2 = 3:I:I 
e = 2 e = 3 e = 4  e = 5  
89 652 323 1 1 50 
100 499 2959 1 82 
125 1 178  1 97 1  240 
70 873 2264 208 
84 267 62 52 
75 2 10  20 56 
269 1 78 1 84 305 
89 1 60 67 1 57 
1 82 192 204 74 
1 203 970 1064 699 
175  Stl 1 77 1 45 
42 1 482 605 427 
47 79 1 0 1  72 
297 1 87 268 1 26 
65 1 409 7 10 5 19 
4.6.4 Summary 
The results of this study have found that the differences between the error rate estimators for 
CART were most affected by dimension, sample size, distance between populations and the 
priors-covariance structure of the data. The differences in error rate estimators for LOA and 
QDA were affected most by all the above factors, except distance between populations, as 
well as being affected by the distribution of the data. Considering the error rate estimators 
over all methods, it was found that estimation of the actual error rate depended very much on 
the sample size - distance interaction as well as the distribution - priors - covariance structure 
interaction. 
Overall, the QOA and CART error rate estimates most closely approximated the actual error 
rate, excluding the apparent error rates for CART, which were unreliable in most situations. 
Of the n-fold cross-validation estimates, those for QDA and CART were the best, usually 
having the lowest mean square error. The n-fold cross-validation estimates for LOA, away 
from the ideal situations, were found to be rather poor. 
The 0.632 error rate estimate for CART was found to be very reliable throughout and not 
influenced to a great extent by any of the factors. This was confirmed by a separate ANOV A 
explaining the effects on the 0.632 estimator alone. It was noted that this estimator was 
particularly good for less well separated populations. On the other hand, the 0.632 estimate 
for LOA (and QOA) was sensitive to changes in the priors-covariance structure of the data 
and lognormal data. 
4. 7 CONCLUSIONS 
In this chapter, a comparative study was undertaken for four classification methods, namely 
LOA, QOA, CART and FACT on the basis of predictive accuracy. The four methods were 
compared over different dimensions, sample sizes, distances between populations,  
distributions and priors-covariance structures. The results showed that LOA and QOA 
performed best for normal data, h igher dimension and well separated populations, whi le 
CART performed well for lognormal data, lower dimension, less well separated populations 
and equal covariances with unequal priors. QOA was found to be the preferred method in the 
case of unequal covariances. In most situations, FACT's prediction rules were a poor fourth. 
1 13 
1 14 
In Section 4.5, a study was undertaken,  based on the findings from real data sets, to determine 
the effect on the individual group error rates for each of the four c lassification methods. 
These studies found that CART was affected quite drastically by the ratio of class sample 
sizes used, though not to the same extent as FACT. The individual error rates for LDA and 
especially QDA were least affected by unequal class sample sizes. It was recomm,Cnded from I 
that study that caution should be shown when using CART on data sets with grossly unequal 
class sample sizes. 
In the final section of this chapter, an investigation was carried out into the reliability of each 
n-fold cross-validation error rate estimate for LDA, QDA and CART over the different 
probability models studied. The results showed that the n-fold cross-validation estimates 
were fairly reliable for QDA and CART in most situations, but that for LDA, were unreliable 
in situations where there were unequal covariance matrices. A deal of promise was shown by 
the 0.632 estimator for CART, in that it performed uniformly well over all situations studied. 
The reliability of this and other error rate estimates wil l  be investigated in more detail in 
Chapter 6. 
It could therefore be concluded that LDA and QDA would be preferred over CART in many 
situations, and CART as the preferred method in others, if accuracy were the sole measure of 
the performance of a method. 
5.  SIMULATION STUDIES INVOLVING CATEGORICAL DATA 
5. 1 INTRODUCTION 
Often in multivariate data settings, problems do not involve continuous variables. Rather, the 
problem may involve ordered categorical variables such as the ratings of a certain product 
(bad, average, good) or number of years education. These variables can be treated as 
continuous although the requirement of multivariate ellipsoidality may not always be met. In 
other situations, the problem may involve unordered categorical or nominal variables, 
whereby there is no natural ordering of the categories. Race and marital status are two 
examples of nominal variables. 
The general approach taken by traditional discrimination methods for such variables is to 
code the c categories into (c- 1 )  binary variables where 
Xj = {6 if ci present else 
As seen earlier, CART and many other tree-based procedures do not require the use of binary 
variables to handle nominal variables. Instead, most tree-based procedures attempt to find the 
grouping of categories that leads to the least overall misclassification error. FACT, in 
contrast, is one tree-based method which takes the LDA approach to handling categorical 
variables. 
This chapter focuses on a comparison of LDA, QDA, CART and FACT for the above type of 
data. Firstly, the four methods are compared from an accuracy point of view in classifying 
observations into two distinct populations, 0 1 and 02. For the sake of a direct comparison to 
be made between CART and the other three methods, only p-variate binary data is used. As 
in Chapter 4, the reliability of these classification methods is investigated as well as different 
error rate estimates for LDA and QDA. 
1 15 
5.2 PREVIOUS STUDIES 
1 16 
Moore ( 1 973)  carried out some simulation studies using six-dimensional, bimodal, b inary 
data sets comparing LOA and QDA with various multinomial procedures. Two factors were 
varied, those being 
Pij = probability of getting a response Xj = 1 for ni 
and rijk = correlation between xj and xk for ni? 
Moore's results showed that LOA performs very well except when there is a "reversal" in the 
log-likelihoods for each population . Moore illustrates by using the following example. Let 
x 1 and x2 be given as 
{ 1 if birth weight is high x1 = 0 if birthweight is low 
{ 1 if gestation length is long x2 = 0 if gestation length is sh011 
A baby would be classified as normal when x 1 = 0 and x2 = 0 or x 1 = 1 and x2 = 1 ,  otherwise 
it is abnormal. 
The optimal linear rule would be to assign x to n 1 if 
2 
a(x) = f3o + 2: f3ixj > c . 
... ?? j=l . 
Now a( l ,  1 )  = a(O, 1 )  + a( l ,  0) - a(O, 0). 
As a( l ,  1 )  > max { a( l ,  0), a(O, 1 ) } 
=> a(O, 0) < min{ a( l , 0), a(O, 1 ) } 
then, if ( 1 ,  1 )  is assigned to TI1 , (0, 0) has to be assigned to TI2. This leads to gross errors in 
misclassification. The problem is in using a monotonically increasing function of x 1 and x2 
to approximate the log-likelihood, L(x),  which is not monotonic. In the two-dimensional 
case, the problem can be solved quite simply by including an interaction term in the model. 
When there are a large number of variables, however, this approach is infeasible. 
Others have shown that not only different correlation structures in the two populations lead to 
these reversals, but so will moderate and large positive correlations. Krzanowski ( 1 977) 
considered a mixture of both binary and continuous random variables with various values of 
P ij used and ri jk set to either 0 or 0.375 (no, or moderate, positive correlation). The results 
showed that LDA performed well when there was no correlation between the binary 
variables, but performed poorly if there was a moderate positive correlation among all the 
binary variables or the correlations between the binary and continuous variables differ 
markedly between the two groups. For both types of data, QDA has been found rarely to 
perform as well as LD A. 
Ganeshanandam and Krzanowski ( 1 990) compared a number of different error rate estimators 
for LDA as well as the n-fold cross-validation estimate for QDA, for multivariate binary data. 
Their simulation results showed that the Pij 's  and sample size all had significant effects on the 
estimation of the actual erTor rate though the rijk factor did not. 
.3 SIMULATION STUDY I 
5 .3 . 1 Study Plan 
The factors used in this study were the same as those employed in Ganeshanandam and 
Krzanowski ( 1990) in order to be able to check the results for LDA and QDA against theirs . .  
The factor p had settings of 5 and 10. A separate analysis was done for each of the two 
dimension levels. Factor n had three settings, those being "small, medium and large" relative 
to the number of variables. In the case of p = 5, n = 20, 60 and 100, while for p = 1 0, n = 40, 
1 20 and 200. In all cases, 1t1 = 1t2 = 0 .5 implying that class sample sizes were equal. When 
p = 5 ,  factor rijk had two levels, those being, all rijk = 0 and all rijk = 0.25. When p = 1 0, all 
rijk were set to zero for the first five variables and to 0.25 for the second block of five. The 
last factor was the values of the Pij 's. The levels of the Pij '  s are shown in Table 5. 1 with level 
1 corresponding to wide separation between groups with increasing levels leading to narrower 
separation. Level 5 corresponds to identically distributed populations. For p = 1 0, the Pi/ s 
for the first block of five variables were repeated for the second block of five. This gives 45 
multinomial learning samples for p = 5 and 10 combined. Three replicates for each 
multinomial situation were conducted for p = 5 and six replicates for p = 1 0, giving 1 80 data 
1 17 
1 18 
sets in total (90 for each dimension size) .  Generation of the data was a straightforward 
process using MINIT AB macros. 
Table 5.1: Values of Pij for each set of five binary variables. 
Level n i  n2 
XJ x2 x3 x4 X) XJ x2 x3 x4 xs 
1 0.20 0.20 0.20 0.20 0.20 0.80 0.80 0.80 0 .80 0.80 
2 0.25 0.30 0.35 0.40 0.45 0.75 0.70 0.65 0.60 0.55 
3 0.40 0.45 0.50 0.55 0.60 0.60 0.55 0.50 0.45 0.40 
4 0.25 0 .30 0.35 0.40 0.45 0.45 0.40 0.35 0 .30 0.25 
5 0.30 0.40 0.50 0.60 () _ 70 0.30 0.40 0.50 0 .60 0.70 
The methods were compared by means of the n-fold cross-validation error rates, R(CV), 
except for FACT where ten fold cross-validation was used as outlined in Section 4.3.  For 
both CART and FACT, the minimum size below which a node will not be split was set to 
five, while for CART alone, the one standard error rule was used. A split-plot ANOV A was 
used to identify which factors lead to differences between the methods, as in Chapter 4. 
Tables of means and the standard deviations of the differences between means are presented 
for each significant effect. 
5.3.2 Results 
For the case p = 5, the results showed that the R (method) main effect (F = 1 0. 1 9) was highly 
significant and dominated the variation in error. All interactions involving R had F-values 
less than L L This meant that there were differences between the methods when using p = 5 
binary variables, and these differences were not intluenced by other factors. 
In both the analyses for p = 5 and p = 1 0, the plot of residuals against fitted values showed no 
real trends, in contrast with the results for the continuous data. The plots of residuals against 
each method showed there to be roughly equal residual variances for each method. Individual 
ANOVA's were constructed for each method separately. These results confirmed the above 
finding of equal residual variances. Therefore, the results of the split-plot ANOV A are 
strictly valid. 
The average values for each method, when p = 5,  are given in Table 5 .2, as well as the 
standard error of difference between the methods. The results showed that CART was the 
best method by some distance from LDA, FACT and QDA. 
Table 5.2: Means and standard error of the differences in means of the cross?
validation error rate estimates for each method. 
= 5  
Level LDA QDA CA RT FACT 
0.343 0.368 .304 .358 
Standard error of the difference between means = 0.0 13 .  
In the case of p = 1 0, the split-plot ANOY A showed that the R(method) m ai n  effect (F = 
20. 78)  was highly significant, though the R * Pij (method by probability pattern) interaction 
(F = 4.2) and R * n (method by sample size) interaction (F = 4.9 1 )  were also highly 
significant (a < 0.00 1 ) . This showed that for p = 1 0, there were not only differences between 
the methods but that these differences were affected by both sample sizes, and to a s lightly 
lesser extent, the pattern of probabilities in the parent populations. 
Table 5 .3  shows that increasing sample size had the effect of reducing the error rates for all 
methods, except CART, where sample size had no real effect. The greatest reduction in error 
rate occurred when going from n = 40 to n =  1 20, for LOA, QDA and FACT. CART did best 
notably when n = 40 and slightly better than LOA when n = 1 20. When n = 200, however, 
LDA did better than CART. FACT was a poor fourth except when n = 40, where, because of 
the smal l  ratio of dimension to class sample sizes, QDA did worst. 
1 19 
120 
Table 5.3: Means and standard error o f  the differences in means of the cross. 
vali dation error rate estimates for each cl assification method with respect to 
sample size (n). 
= 1 0  
Level LDA Q DA CA RT FACT 
n = 40 0.329 0.40 1 0.284 0.366 
n = 1 20 0.289 0.307 0.287 0.345 
n = 200 0.274 0.292 0.284 0.333 
Standard error of the difference in m eans = 0.0 1 6 . 
Increasing the level of the probability patterns, Pij? effectively narrowing the distance between 
populations, h ad the effect of increasing the error rates for all methods, with the two tree?
based methods being less affected than both LOA and QDA. Table 5 . 4  shows that LDA did 
best for Pij = 1 and narrowly better than CART when Pij = 2 .  For all other levels of Pij ' CART 
had the lowest average error rate. An explanation why LDA did best for Pij = 1 is that this is 
an exam ple o f  a parallel classification problem (see Section 4.3),  in  t hat al l variables are 
equally important in determining the c lassification rules. Levels 2 to 4 for Pij are examples of 
a sequential classification problem , in that only a subset of the variables are ever used to 
determine the classification rules. Methods such as LOA and QDA are suited to the former 
type of problems while CART is designed for the latter. The good performance of C ART for 
less well separated populations m irrors what was observed for continuous data in Chapter 4. 
A noteworthy trend observed here was that the average error for C ART in the case of 
identical populations (Pij = 5) was 0.43, so that CART was managing to build a tree from 
noise. All  other methods for Pij = 5 conta ined error rates in the vicinity of 0.5 .  
Table 5.4: Means and standard error of the differences in means of the cross?
val idation error rate estimates for each classification method with respect to 
probabil i ty patterns (Pij) . 
= 1 0  
Level LDA QDA C A RT FACT 
Pij = 1 0.073 0.097 0. 1 32 0. 1 75 
Pij = 2 0. 177 0.230 0. 1 89 0.249 
Pij = 3 0.346 0.389 0.325 0.4 10 
Pij = 4 0.365 0.426 0.349 0.4 15 
p? ?  = 5 I] 0.524 0.525 0.430 0.492 
Standard error of the difference in means = 0.02 1 .  
5.3.3 Summary 
For five-dimensional categorical data, it was found that there were differences in the cross?
validated error rates of the four methods, but these differences were due only to the methods 
and not to any other factor such as sample size or probabil ity patterns. It was found that 
C ART was clearly the best method followed by LOA. 
For ten-dimensional categorical data, the differences in error rates were due not only to 
method, but also to probabil i ty patterns and sample size. CART was found to be the least 
affected of al l  methods by changes in either sample size or the pattern of probabil ities and had 
the lowest error rate for smaller sized samples and poorly separated populations. These 
results echo very much what was observed for CART in Chapter 4, for continuous data. In all 
other situations, LOA did best. In accordance with other studies, QDA performed poorly 
except w hen sample size was large or the two populations were wel l  separated. As for the 
continuous data, FACT was a poor fourth in almost all situations. 
121 
5.4 SIMULATION STUDY II 
5.4. 1 Introduction 
122 
Section 5 .3 comp ared the accuracy of four classification methods in the setting of categorical 
data, specifical ly  w ith p-variate binary data. This section compares the reliabi lity of the 
cross-validation error rate estim ates produced by three of the fou r  methods as well as a 
comparison of different error rate estimators using both LOA and QDA Further analyses for 
FACT were not done due to the poor predictive capability exhibited by the method through 
the simulation studies. 
5.4.2 Study Plan 
The first analysis compared the n-fold cross-val idation, R(CV) ,  and apparent, R(A), 
estimators for both LOA and QOA, as wel l  as the rotation, R(ROT), and 0.632, R(0.632), 
estimators for LDA alone. The latter was calculated as in (4.2.2 3 ) .  For both p = 5 and 
p = 1 0, when n was small, the R(ROT) estimates could not be calculated for QOA as n i ? p 
for each class. Therefore, the R(ROT) and R(0.632) estimators for QDA were not included in 
the analysis. 
The R (CV) estimators for LOA, QOA and CART were then compared to test their reliability 
in estim ating the actual error rate, R(T) , for each data set used in Section 5 .3 .  Test samples of 
size 5000 were used throughout to calculate the values of R(T). A comparison of different 
error rate estimators for CART, using categorical data, will be undertaken in a latter chapter. 
5.4.3 Results 
The results are presented in the same format as Section 5 .3 .  Comparing the error rate 
estim ators for LDA and QOA first; in the case of p = 5,  the split-p lot ANOV A showed that 
there was a definite difference between the error rate estimators , R, (F = 34.76) and that 
those d ifferences depended to a large extent on sample size, R * n (F = 7 .64) and, to a m uch 
lesser degree, on the pattern of probabilities in the ni ' s, R * Pij (F = 2.37). These results very 
closel y  follow the results exhibited in Ganeshanandam and Krzanowski ( 1 990) where the R * 
n effect was also more important than the R * Pij effect. Any effect involving rijk had very 
l ittle effect o n  the estimation of R(T) .  
For p = 1 0, it was observed that, in addition to the R, R * n and R * Pij effects being highly 
significant, the R * Pij * n interaction (F = 3.96) was also significant at the 0. 1 %  level. 
Ganeshanan dam and Krzanowski gave no indication of the importance of this second order 
interaction, but, it must be taken into account in any analysis of means implying that the error 
rate estimators for both LDA and QDA were influenced by sample size in conjunction with 
the pattern or probabilities. 
It is clear from Table 5 .5  that the mean square errors (MSE's) for all estimators decreased as 
sample size increased, that is, the estimators were more precise for larger rather than smal ler 
sized samples. The R(CV) estimator was the least affected by altering sample size while the 
R(A) estimators were most affected. The R(CV) estimator for LDA did best for the smallest 
sized samp les, the R(ROT) estimator for n = 60 and the R(CV) estimator for QDA for the 
largest samples. These results were in relatively c lose agreement to those of Ganeshanandam 
and Krzanowski ( 1 990), although their results showed sample size to have no effect on the 
MSE for the R(0.632) estimator. Note that the standard error of the difference w as not 
c alculated from either of the apparent estimators. This occurred, because the variation in 
MSE's for these two estimators was four to five times larger than that for the other estimators. 
Table 5.5 Means and stand ard error of the d i fferences i n  m eans of the M S E's for 
each error rate esti mator with respect to sample size (n) ( *  10-4) 
= 5  
Level n = 20 n = 60 n = 100 
LDA, R(CV) 1 39 56 46 
LDA, R(A) 443 1 07 4 1  
LDA, R(ROT) 1 8 1  47 35 
LDA, R(0.632) 200 49 25 
QDA, R(CY) 2 1 6  7 1  23 
QDA, R(A) 685 252 1 55 
Standard enor of the difference between means = 37.2. 
123 
124 
Tabl e  5.6 shows that the error rate estimators exhibited different behaviour over the various 
levels of Pij ? The R(CV) estimator, for LOA, had lowest MSE for P ij = 1 and h ighest for 
Pij = 3,  while the R(0.632) estimator had lowest MSE for Pij = 3 and almost the h ighest at Pij 
= 1 .  R(CV) did best overall for Pij = 1 ,  4 and 5, the R (0. 632) estimator for Pij = 2 and the 
R (ROT) estimator for Pij = 3. In the case of p = 5, these results fairly closely m atched those 
o f  Ganeshanandam and Krzanowski ( 1990) where R(0.632) was found to perform best for 
less well separated populations. 
Table 5.6: Means and stand ard e rror o f  the d i fferences in means of the M S E's for 
each error rate estimator with respect to p robabi l i ty p a tterns (Pij ) (* 10-4) 
= 5 
Level Pij = 1 Pij = 2 P ij = 3 Pij = 4 Pij = 5 
LDA, R(CV) 22 1 3 1  85 85 75 
LOA, R(A) 74 1 84 2 1 4  268 245 
LDA, R(ROT) 86 96 48 1 20 89 
L OA, R(0.632) 1 1 8 74 63 1 1 8 84 
QDA, R(CV) 56 1 1 4 1 09 1 1 7 1 20 
QDA, R(A) 172 2 1 9  48 1 485 462 
Standard error of the difference between means = 48. 
Figures 5 . 1 to 5 .3  show the MSE' s for the different error rate estimators over the d ifferent 
levels of Pij ? for sample sizes of 40, 1 20 and 200 respectively. The apparent estimators for 
both LOA and QOA are not shown due to their exceedingly h igh MSE's  in  most cases. The 
general trend for the other estimators was an increase in M SE as sample size decreased while 
decreasing the distance between populations generally increased the MSE, though there were 
some exceptions. The graphs show that the R(CV) estimator was the most consistent, and 
thus rel iable over al l  combinations of pro babi l i ty pattern and sample size. The R(CV) 
estimator for QOA was sometimes the most rel iable estimator, but in other situations was the 
least rel iable, espec ially for larger sample sizes. The R(0.632) estimator did best for poorly 
separated populations and n = 1 20 or 200, though not for populations which were the same 
(Pij = 5) .  
Compulton of the 1181'1 of Pour Jrror JatlmatJOD 
llethocla tor LD! and QDA tor Sample Size? of 40 
p = 10 
0.02 -1 
m o.J 
0.00 -1 
Jlpre 5.1 
I 
I 
I 
I 
? 
1 
p. 
I \ \ 
? 
,-, 
? . I \ ?I \ 1' . \ 
\ . I . \ 
, . I G ' .\ I \ . \ I . . \ \ I p.... ? .  
. -? A I I ,,.... ?\. 
I 
r 
2 3 4 5 
Probability Pattern Level 
A - LOA, R(CV) 8 - LOA, R(ROT) 
C - LOA, R(0.632} D - aJA, R(CV) 
1 1  
Compulton of the IIBI'a of Pour Error latlmatlon 
Method? tor LDA and QDA tor Sample Size? of 120 
p = 10 
I 
0.015 
0.010 
? 
0.005 -1 
rtpre 5.2 
I 
1 2 
71 I \ 
\ I \ 
\ I \ I 
\ A I -\ ' r'  '\ ? c 1  I I \ . ' 1 .C . B l 
I? 
I . . . . \ . "  . . ? I ? . . . 
1." ? 
I I I 
3 4 5 
I 
Probability Pattern Level 
A - LOA, R{CV) B - LO-', R(ROT) 
C - LOA, R(0.632) D - aJA, R(CV) 
1 1  
1 1  
1 1  
1 1  
Comparhon of the KBI'I of rour lrror latlmatloD. 
Method? for LDA ancl QD! for Sample stze1 of 200 
p = 10 
0.008 
0.007 
0.006 
0.005 
I l l 0.004 
1 1  0.003 
1 1  0.002 
0.001 
0.000 
1 2 3 4 5 
Probablltty Pattern Level 
A - LOA, R(CV) 8 - LOA, R(ROT} 
rtpre CS.S C - LOA, R(0.632) D - OOA, R{CV) 
A comparison of the rel iability of the n-fold cross-validation error rates for LDA, QDA and 
CART showed that when p = 5, the R main effect was not significant. That is, there were no 
differences in the rel iability of the R(CV) estimators between the three methods. 
For p = 1 0, the R main effect (F = 3.45) was the onl y significant result, and that only at the 
a =  5% level. Table 5.7 shows that both the LDA and QDA R(CV) estimators were roughly 
equally reliable while that for CART was comparatively high ( roughly twice the magnitude of 
the LDA estimate). This discrepancy was due mainly to the overoptimistic estimates for R(T) 
when Pij = 5 produced by cross-validation. 
T able 5.7: Means and stand a rd e rror of the d i ffe re nce in m ea ns o f  the M SE's for 
t h e n-fold cross-va l i d ation e rror ra te esti mates for each cl assi fication m e thod 
(* 10-4) 
= 10 
Level LDA QDA CA RT 
43 48 82 
S tandard enor of the difference between means = 1 7 .  
5.4.4 Summary 
126 
For five-dimensional categorical data. it was found when comparing the reliability of various 
e nor rate estimators for LDA and QDA that sample size and probabi l i ty patterns for each 
population were important i n  determining differences between error rate estimators. The 
R(CV) estimator for LDA did best for either small sam ples or large differences between 
populations while the R(ROT) and R(0.632) estimators were better for larger samples or 
smaller differences between populations. For ten-dimensional categorical data, it was found 
that the interaction of sample size and probability pattems was important in differentiating the 
estimators, while R(CV) was the most rel iable estimator overall. 
In  comparing the n-fold cross-validation estimators for LDA, QDA and CART, i t  was found 
that there were no differences present in the five-dimensional case. However in the ten?
d imensional case, LDA and QDA had the most rel iable estimates with CART some d istance 
behind. 
5.5 CONCLUSIONS 
In this chapter, four c lassification methods were compared in the setting of p-variate 
categorical data. In the case of five-dimensional categorical data, the only significant effect 
was the overall d ifference between the methods, where CART was fou nd to be the best 
method. For the ten-dimensional data, the resul ts followed a very similar pattern to those for 
the continuous data in  that LDA did best when every variable counted an equal amount for 
the classification rules or where there was fairly good separation between groups. CART, on 
the other hand, did better for less wel l  separated groups or where only a few vari ables were 
important to the creation of the classification mles. As wel l ,  CART did better for smaller 
samples and LDA for larger samples. thus appears panicul arly useful for categorical data. 
A comparison of different e rror rate estimators for LDA and QDA showed that the n-fold 
cross-validation estimator for LDA was the better estimator of the actual error rate. A 
comparison of the n-fold cross-vlidation error rate estimators for LDA, QDA and CART 
showed that the CART estimator was the least reliable of the three. 
The last finding led to a com parison of the various error rate esti mators for C ART using 
categorical data. the results of which are reported in Chapter 6 along with a comparison of the 
various estimators for continuous data. 
Another interesting finding was that there was only a minor difference between the third and 
fou rth levels of the p robability pattern variable (see Table 5 . 1 ) . Ganeshanandam and 
Krzanowski ( 1 990) reported that the error rates for the third level were twelve to seventeen 
points higher than those for the fourth leveL The results here have shown that difference to 
be approximately only a few points. It is possible that Ganeshanandam and K rzanowski 
( 1 990) actually used different probability patterns than those stated, because theoretically, the 
error rates for the third level should be closer to those in level 4 rather than level 2 as 
occurred in their studies. 
127 
128 
6. CART SIMULATION STUDY 
6. 1 INTRODUCTION 
In Section 4.6, an investigation was canied out into the performance of various techniques for 
determining tree size and estimating the actual error rate in CART. Recent studies have 
suggested, however, that the range of Bayes error rates directly affects the performances of 
the error rate estimators, especially the 0.632 estimator for CART in the c ase of continuous 
data. 
The objective of this chapter was firstly to com pare the various error rate estimators over a 
wider range of Bayes error rates, and reduced sample sizes from those studied in  Chapter 4. 
Numerous studies (Efron, 1 983, Gong, 1 986 and C rawford, 1989 for i nstance) have shown 
that sample size is a crucial factor in deten i ning the perfonnance of an error rate estimator. 
A second objective was to compare the vari ous error rate estimation tec hniques for CART 
over the categorical data sets that were used in Chapter 5, to determine if simi lar patterns as 
were observed for continuous data could be seen. 
Thirdly, based on the comments of Feng et a! ( 1 993), a compa1ison between the zero and one 
standard error rules for selecting the right sized tree, was canied out, in order to determine in 
which situations, if  any, one should/should not use the one standard error rule. 
A final objective was to brought about by a study of Fitzmaurice et al ( 1 99 1 ) . They affirmed 
that the untransformed en?or rate scale, bounded by 0 and l ,  may not be appropriate for the 
comparison of different methods. Thus, a num ber of transfonnations were c arried out on the 
error rates and the effects of the transformations analysed. 
6.2 ERROR RATE ESTIMATION FOR CONTINUOUS DATA IN CART 
6.2.1  Previous Studies 
Thi s  study was m o tivated by the work of B reiman et al ( 1 984) and C rawford ( 1 989). 
Breiman et al in  S ection l l . 7 ,  suggest that o n  the basis of tests on both real and simulated 
data sets, the bootstrap error rate estimate had lower variance than the cross-validated error 
rate estimate, but w as h ighly overoptimistic when compared with those based on g-fold cross-
129 
130 
validation. W hen the learning sample is large. they state that the bias effect dominates the 
variance so that the g-fold cross-validation estimator is superior to bootstrapping. 
B reiman et al ( 1 984) suggested that perhaps a modified bootstrap estimator could be used to 
determine both optim al tree size and provide an estimate of the actual en?or rate. Crawford 
( 1989) tested these assertions by comparing the performance of cross-validation, the bootstrap 
and the 0.632 bootstrap, using 
R (0.632) = 0.368 * R(A) + 0.632 * R(E). 
He found for small data sets (n = 20) that the 0.632 estimator, R(0.632). was the best in terms 
of having the lowest mean sq uare en?or (MSE). while for larger samples (n = 1 00), the cross?
validation estimator, R(CV). was best for high values of R(B) but R(CV) for low values of 
R(B). Crawford suggested the use of a combined strategy whereby n-fold cross-validation 
was used to select the right sized tree and R (0.632) to estimate R(T) on the selected tree. 
Crawford concluded that this combined approach minimised the chance of poor performance 
when faced with either a high or low value of R (B). 
In Section 1 1 .5 ,  Breiman et al. p 85.  affirmed that " . . .  we have not come across any situations 
w here taking [g] larger than 1 0  gave a significant i m provement i n  accuracy for the tree 
selected." They suggest that the use of ten fold cross-validation gives adequate accuracy in 
m ost situations, and indeed. this is the default value used within the CART program. As yet, 
no results have appeared in the literature validating these assertions. 
6.2.2 Study Plan 
The aim of this study was to use CART to compare the performance of the n-fold and ten-fold 
cross-validation, rotation and 0.632 estimators. along with the associated apparent error rates 
in approximating the actual error rate. R(T). of the sam ple. As in Section 4.6. R (T) was found 
by using an independent test sample of size 5000. The objective was to expand on the work 
of Breiman et al and Crawford ( 1 989) in order to fi nd out which method was the best in 
selecting the most "honest" sized tree. The error rate estimators were as used in Section 4.6. 
The 0.632 estimator, using R (ROT) in the equation i nstead of R(?) provides a simple 
alternative to the combined strategy proposed by Crawford ( 1989). His proposal involved a 
double calculation, hence a large increase in processing time, in that n-fold cross-validation 
w as needed to select the right-sized tree, then B bootstrap samples had to be generated i n  
order to estimate R(T). This version uses the rotation method to calculate the right-sized tree 
and then uses that en?or rate in the equation for R(0.632). 
In  addition, a comparison of the sizes of the trees produced by each method w as m ade to 
determine if there were any differences between methods. 
The data were generated from two multivariate normal populations, as it is known that CART 
is invariant under monotone transformations of the variables. Three factors were varied in a 
full factorial design ; R(B ) ,  the Bayes enor rate ; n .  the sam ple size with rc 1 = rc2 for all cases; 
and q, a combination of dimension (p),  means and correl ations between variables. 
The values of the first and second factors were: 
(i) R(B) = 0.05 . 0. 1 5 , 0.25 and 0.35. 
(ii) n = 20, 1 00.  
The third factor, q ,  had levels whereby J.l Jj = 0 for all j ,  where J.l ij is the mean of the first 
population for variable j ;  and J.l2j is the mean of the second population for variable j .  
1 :  p = 2, J.l21  = J.l.22? p = O 
2: p = 2, J.l21 = J.l22, p = 0.5 
3 :  p = 4  whereby J.l2 1 = 2).1.22 = 6).1.23; J.l24 = 0 
1 0 -0. 5 0 l 0 0. 7 5  
0 1 0 0 0 1 0 
and PI = and P, = -0. 5 0 1 0 0 . 7 5  0 l 
0 0 0 1 0 0 0 
where Pi = [(Pijk)] is the population con?elation matrix for fh 
0 
0 
0 
The values for the first and second factors were similar to those used b y  Crawford ( 1 989) and 
Fitzmaurice ( 1 99 1 ) , except that no studies were done for R(B) = 0.45, where, as noted by 
Fitzmaurice et al, any classification rule which produced an error rate in the region of 0.45 
would probably not be widely used. 
1 3 1  
1 32 
The levels of  the third factor were chosen after some conclusions by Quinlan ( 1993) about 
parallel and sequential classification problems and summarised in  Chapter 4. Therefore, in 
this study, the first two levels of factor q cotTespond to situations which are less favourable to 
CART while the third level corresponds to a situation m ore favourable to CART. 
Each of the 24 factor combinations was used for 25 simulations. The number of simulations 
was chosen so as to be able to adequately depict the true trends. The effects of a few 'bad' 
samples wi l l  be m i nim ised by the large number of 'good ' samples. Four criteria of 
performance were used to compare the various error rate estimators, namely, the bias of each 
technique i n  estimating R(T), where 
bias = R(n - R(T), T = CV, A, ROT, 0.632 or TEN 
the standard deviation of the bias; thirdly, the MSE, where 
MSE = E[(R(t) - R(T))2] .  
A fourth m easure used was the COUNT cri tetion. corresponding to the proportion o f  samples 
for each factor combination in which the estimated error rate was less than the actual error 
rate, and is therefore a measure of the optim ism involved in using each estimator. A large 
value for COUNT, say > 75%, would correspond to an overoptimistic estim ation whereas a 
low value for COU NT, say < 25%, would correspond to a pessimistic or under optimistic 
estimation. 
For all the data sets in this section. the zero standard error rule was employed while the size 
below which a node will not be split was set to five. Independent test sam p les of size 5000 
were used throughout to determine R(T). 
6.2.3 Results 
As in Chapters 4 and 5 ,  a split-plot ANOV A was used to assess the relative i mportance of the 
experimental factors i n  int1uencing the MSE's  for the various error estim ation techniques. A 
large n umber of repl icates were catTied out in this study, in contrast to C hapters 4 and 5, 
hence the F-ratio from the ANOV A should not be regarded as a true measure of the 
significance of each result. 
The R (method) m ain effect (F = 87 .68) and R * R(B ) (method by Bayes error) interaction 
(F = 20. 1 8) dominated the other R * factor interactions. All  other effects were very s m al l  
though, rather surprisingly, the R * R(B) * n * q interaction was the next largest (F = 2.57) .  
Therefore, i t  was decided to compare the four error esti m atio n  techniques over a l l  possible 
combinations of R(B) ,  n and q.  
The results of the average bias,  standard deviation,  M S E ' s  and COUNT's are presen ted 
graphically in Figures 6 . 1 to 6.24. Only the R(CV), R(ROT), R(0.632) and R(TEN) 
estimators are shown as the values for the respective apparent  estimators were on m ost  
occasions highly overoptimistic, leading to extreme values of the above four measures. 
Figures 6. 1 to 6.6 show the a verage bias val ues for each estimator over the ranges of the 
factors used. The results show that for almost all values of R(B),  n = 20 and q = 1 and 2, that 
R (0.632) had the lowest bias. All other methods were m a rkedly overly pessimistic in the 
estimation of R(T). The exception to the rule was when q = 2 and R(B) = 0. 1 5 . When q = 3, 
however, a differen t picture emerged . The R(CV), R (0.632) and R(TEN) estimators al l  had 
similar bias with this bias i ncreasing pessimistical ly as R(B) increased. The R (ROT) 
estimator was consistently m ore pessimistic than the other three estimators. 
For larger samples (n = 1 00), it is noticeable that the perfo rmance of the R(0.632) estimator 
deteriorated as R(B)  increased in that the bias became overly optim istic . The R(CV) 
esti m ator was usual ly the least biased for R(B) = 0.05 and 0. 1 5 , but deteriorated for h igher 
R(B ) .  In those situations, R(TEN) produced the lowest en?ors. For highly separated 
populations (R(B ) = 0.05), the R (0.632) estimate was comparable to or better than the R(CV) 
estimate. As with the smaller samples, the R(ROT) estim ator was consistently pessim i stic. 
Turning to the standard deviations of the estimates. it can be seen that for smaller samples 
(Figures 6.7 to 6.9) , for q = 1 and 3, that all estimators exhibit a distinctive inverted U shaped 
pattern in that the lowest standard deviations occurred for either the lowest or h ighest R(B). 
For q = 2, the trends were different, for all estimators. In terms of performance, the R(0.632) 
estimator had lowest variability when R(B) = 0.05 and 0. 1 5 ,  though highest variability when 
R (B )  = 0.25 and 0.35. The large variability of the R(CV) and R(TEN) estimators is clearly 
evident. 
133 
Compullon of the Bill of Pour Brror Batlmatlon 
letho4a tor CAlf Ylth Sample Size? of 20 ancl ct=l 
Comparllon of the Btaa of Four lrror llttmatlon 
lethodl tor CART with Sample Stzea of 20 and q=2 
Comparhon of the Bill of Four Brror lattmatlon 
lethocll tor C!RT with Sample Slzea of 20 and q=3 
0. 10 
0. 10 
0.05 
? ? 
.. .. 
? 0.05 ? 
e . . . . .  . 
e . . 
. C ? ? ? ? ? ? C . 0.00 
c 
0.00 ...,_---r---?--r' 
0.05 0. 1 5  
1\?Q.te 8.1 
R(B) 
0.25 0.35 
A - R(CV) B - R{ROT) 
0.05 
C - R(0.632.) 0 - R(TEN) \ \?ll'lfe 8.2 
0 . 15  
. c. 
0.25 
R(B) 
c 
0.35 
0. 10 
= 0.05 
? 
0.00 
I 
? 
I 
I 
I 
I 
I 
I 
I 
I C . I 
I 
I 
s. - - - -?  
I 
I 
I 
I 
I 
I 
\ J \ ?  
0.05 0. 15 
R(B) 
0.25 
' ' ' ' 
'B 
0.35 
A - R(CV} 8 - R(ROT) I A - R(CV) 8 - R(ROT) C - R(0.632) D - R(TEN) .JfiUfe 8.8 C - R(0.632) D - R(TEN)) 
Comparllon of the Blaa of Pour lrror lltlmatlon 
letho4? tor CART Ylth Sample Slze1 ot 100 and q=l 
ID ... 
0.03 
0.02 
0.01 
0.00 
iS ?.01 
-().02 
?.03 
?.()4. 
0.05 
c ? . . .
. . ? 
0.15 
R(B) 
e 
0.25 0.35 
A - R(CV} B - R{ROT) 
Comparhon of the Blaa of Four lrror Batlmatlon 
llethoda for CART with Sample Slzea of 100 and q=2 
ID ... j:Q 
0.02 
0.01 
..... 
/ 
..... / 
'8-- - - - -e"' 
/ 
/ 
.B 
/,,.....&?-?-?s-.._ 
0.00 -l [Y""'' 
?
-e 
-<>.01 
e 
-<>.02 . c .,, . . . . c 
-o.03 
0.05 0. 15 0.25 0.35 
R(B) 
Comparhon of the Blaa of Four Error lltlmatlon 
lethodl tor C!RT wlth Sample Slze1 of 100 and q=3 
ID aS iii 
0.03 
0.02 
0.01 
0.00 
-{).01 
-{).02 
" 
B. / 
jJ 
I 
/ 
I 
I 
I 
I 
8 I 
I 
" 
' / 
' / 
' / 
' / ' / / 
0.05 
'g 
1
1 
. I "-........?? 
?c ? ? ? ? ? ? t. ? ? ? ? ? ? c 
0. 15 0.25 0.35 
R(B) 
Plpre 8.-4 C - R(0.632) D - R(TEN) I IFipre 8.6 
A - R(CV) 8 - R(ROT) I A - R(CV) 8 - R(ROT} I C - R(0.632) D - R(TEN) Plpre 8.8 C - R(0.632) D - R(TEN) 
Compulton of the Stu.dard ?latlone of 
Pour lrror leUmaUon Jlethocb for CABT 
nth sample Size? of 20 and q=l 
0. 1 1  
0. 10 
? - 0.09 .. .. 
! 
'2 0.08 ? 
0.07 
0.06 
I? 
I \ 
I \ f A \ 
:s  \ 
/ . \ 
) ,.?/ \ ' ? .\ 
/ '  
'\ / . \ . . /. / : ' 
\ 
?c ll./ ,.... 8 . \ B' ..... ..... . \ \ 8' ..... . \ ,\ \ 
c 
e? 
0.05 0.15 0.25 
R(B) 
\ 
\ 
\ 
\ 
\ 
\ 
\ 
I!J 
0.35 
A - R(CV) B - R{ROT) 
Compuhon of the Stu.dard Deriatlone of 
Pour lrror letlmatlon Method? for CART 
nth Sample Slzee of 20 and q=2 
0. 10 
0.09 
R 0 0.08 -
? Ill -? 
J:: 0.07 ? ... ? 
? ril 0.06 
0.05 ? 
0.04 J 
I 
0.05 
c 
j"'1. 
A l 
:.
" 
/ ...
. J3 
/1 \ ?y? . I 
. c . 
I 
\ . I \ . I 
.
, \ \ :; 
\ } 
\ I '  \ I .' 
B . 
?e 
I I 
I 
I 
0. 15 0.25 0.35 
R(B) 
1 1  
-=a 0 -.. Ill 
Comparllon of the Standard Deriatlou of 
Pour Brror l1tlmatlon Method? for C!RT 
Ylth Bample Size? of 20 and q=3 
0. 1 1  
G 
0. 10 
0.09 
?? 0.08 
?
?? . \ '? . \ c  
\, 
& 
? ... ? 0.07 
? .. Cll 
0.06 
0.05 
l e 
0.04 
0.05 0. 15 
R(B} 
\ 
\ 
0.25 
\ 
\ 
\ 
8 
0.35 
W\ptre 8.'7 C - R(0.632) 0 - R(TEN) \ \l't.,ue 8.8 " - R(CV) B - R(ROT) I A - R(CV) s - R(ROT) C - R(0.632} 0 - R(TEN} Pfp?e 8.Q C - R(0.632) D - R{TEN)I 
For l arger samples (Figures 6. 1 0  to 6. 1 2), a more linear trend is apparent for all estimators i n  
that variability increased as R(B) i ncreased. For q = 1 and 2, the R(0.632) esti mator was 
often the least variable estimator for higher R(B) with roughly similar variability to the cross?
v alidation estimators for lower R(B) ,  but for q = 3, it was the best estimator for low R(B ) and 
worst for h igh R(B ) .  
I n  the next analysi s, the bias and variation o f  the estim ators were combined into the MSE 
c riterion. For smal l  samples (Figures 6. 1 3  to 6 . 1 5) ,  i t  i s  clear that for q = 1 and 2 ,  the 
R(0.632) estimator did best except when R(B )  = 0.35 and that the best results occurred at low 
R(B) and the worst at m oderate R (B) .  It should noted, though, that the R(0.632) estim ator 
was the least affected of all estimators by changes in the values of R(B). In accordance with 
the results of Crawford ( 1 989), the R(CV) estim ate had high M SE, due mostly to the large 
variability, as shown in Figures 6.7 and 6.8.  For q = 3, a slightl y  different picture e merged. 
All estimators had roughly similar MSE except R(ROT) when R(B) :::; 0.25. 
In the case of n = 1 00, simi lar trends can be seen in all three graphs (Figures 6. 1 6  to 6 . 1 8) .  
Generally. the performance of each estimator deteriorated as R(B)  increased. The R(CV )  and 
R(TEN) estim ates performed very m uch the same. The R (0.632) estim ator did best overall 
for q = 1 and 2 while the R(CV) and R (TEN) estimators had lowest MSE for q = 3. 
Figures 6 . 1 9  to 6 . 24 provide another measure of performance giving the values of the 
COUNT's  of optim ism for each estimator. Val ues c losest to 0.5 were the m ost ideal. For 
smal l  samples (Figures 6 . 1 9  to 6 .2 1 ), the trends exhibited are very similar to those exhibited 
for bias. For q = 1 and 2, the R(0.632) estimator produced the most unbiased estimates of 
e rror while the other estimators were highly pessi mistic. For q = 3 ,  all methods, except 
R (ROT), had similar COUNT's. For R(B ) = 0.05 and 0. 1 5 ,  these estimates were around 0.5 
but for R(B ) = 0.25 and 0.35, the estimates were highly pessimistic. 
For larger samples (Figures 6.22 to 6.24), it is clear that for q = 1 and 2, the R(0.632) 
estimator was consistently optimistic. For q = 3, this also occurred when R(B) ;:::: 0. 1 5, but for 
R(B) = 0.05, the proportion of samples where R(T) was either over or underestim ated was 
roughly 0.5.  For q = L R(CV) did best while for q = 2, R(TEN) did best with R(CV) tending 
to be rather optim istic .  For q = 3,  both R(CV) and R(TEN) performed equally well. The 
overly pessimistic nature of the R(ROT) estimator is reinforced by these results. In other 
words, R (ROT) was m uch higher, on average, than the actual error rate, R(T). 
137 
? 0 -
Comparl10n of the Btuclarcl De?latlona of 
Pour Error lltlmatlon Jlethoclt tor CABT 
with Sample Slzea of 100 and q=l 
0.07 
0.06 
/fJ 
? 0.05 /./ I? -& 
ooQ ... G -; 0.04 
.... tll " 
0.03 ?B'? 
0.02? 
0.05 
.i 
1 , ' 
I ? 
./ I . I , ' 
I 
I ' 
I 0' 
?IF . .  
I . .  
0. 15 
I I 
' -8 
I 
? c  
0.25 0.35 
R(B) 
A. - R(CV) B - R{ROT) 
1 1  
1 1  
Comparhon of the St&Aclarcl Deriationa of 
Pour lrror latlmatlon Method? for CART 
with 5ample Sizes of 100 ancl q=2 
0.05 
.? 
? 0.04 I I I ? 0 -
? Clll -... 
? 
? ... Clll ? ? 0.03 
/j . e tll c ? 
0.02 
0.05 0. 15 0.25 0.35 
R(B) 
? -? 
Comparhon of the Stuclarcl Deriatlona of 
Pour Error latlmatlon llethocl1 for CART 
with Sulple Size? of 100 and q=3 
0.05 
I 
I 
I 
I 
I 
c 
6!- - -:- e 
I 
I C 
.! 0.04 
& I 
I . 
'f 
i 
ril 0.03 
w1' 
o.o2 ? c? 
0.05 
I 
I 
I 
.
' 
I 
I 
.
' 
0. 15 0.25 
R(B) 
0.35 
1\au?re 8.10 C - R(0.632) 0 - R(TEN) I I nave 8.11 
A. - R(CV) B - R(ROT) I A - R(CV) B - R(ROT) C - R{0.632) 0 - R{TEN} P!fU'? e.J2 C - R(0.632) D - R(T?N)1 
Compullon of the MSI'a of Pour lrror latlmatlon 
Kethoda for C!RT nth Sample Slzea of 20 ancl q=l 
0.019 
Comparhon of the KSI'? of Pour .Error !etlmatlon 
Kethod.l tor CART with Sample Size? of 20 and q=2 
0.015 
Comparllon of the KSI'? ot Pour Brror Bltlmatlon 
Wethoda for CART with Sample Slzea of 20 ancl q=3 
0.02 
Oompe.rllon of the IISI'a of Pour lrror latlmatlon 
llethocle tor cm Ylth Sample Size? ot 100 and ct=1 
? 
0.005 
0.00+ 
0.003 
0.002 
0.001 
c 
: /J 
./ 
./ ...a-;/ ,.:.. - -8 
I' _8-" ,. 'l . .-d ..
. 
I' f/ . 
I' 
? 
I' ?<r' 
B'l'  /.(G ./. /' ?c 
0.000 1-r----r----r----T' 
0.05 0 . 15  0.25 
R(B) 
0.35 
A - R(CV) 8 - R(ROT) 
Comparlton of the KSI'a of Pour lrror l?tlmatlon 
llethoch tor CART ?lth Sample Slzea of 100 and q=2 
0.0025 
? 0.0015 
0.0005 
8 I 
I 
.? I C I . 
I , ' 
I . 
I . 
I t I '  ( I 
B 
'I ) 
.i 
7 -
- - . . 4 "/ . . - e 
e- ' ? e 
f: ? 
0.05 0. 15  0.25 
R(B) 
0.35 
A - R(CV) 8 - R(ROT) 
Comparlton of the KSI'a of Pour Error letlmatlon 
Kethoda tor cm Ylth Sample Slzea ot 100 and q=3 
0.0035 
0.0030 
0.0025 
0.0020 ? 
0.0015 
0.0010  
0.0005 c 
0.0000 
0.05 
I 
I 
I 
I 
I 
' 
1 c? I 
I 
I . '  
I . 
I . .  
1. ' 
I 
I' 
0. 15 0.25 
R(B) 
/ 
B 
/ 
/ 
/ c 
/ 
0.35 
A - R(CV) S - R(ROT) 
Jlcur? 8.18 C - R(0.632) D - R(lEN) l lrtiUI'? 8.17 c - R(0.632) D - R(TEN) f IP111U'? 8.18 C - R(0.632) D - R(TEN) 
Com)IUhon ot the COUN1"1 ot Four Error .lhtlmatlon 
llethoda tor CART Ylth Sample Slzea of 20 and q=l 
0.5 ? 
?. 
0.4- ? c? 
1 0.3 ? 
0.2 -19"- - - -B. 
0. 1 ? .. I I 0.05 0. 15  
rlrur? e.u? 
c 
c 
I 
1/ \\ I ' \ 
I ?? I 
I I I 
0.25 0.35 
R(B) 
A - R(CV) B - R(ROT) 
C - R(0.632) D - R(TEN) 
1 1  
1 1  
1 1  
Oom)IUhon of the COUNT'? of Four Brror Batlmatlon 
llethodl tor CART with Sample Slzea of 20 and q=2 
0.6 ? , . ? , .c l 
o.s l I ? \ e? I 
r?rp e\\ } 
0.3 -l // 
' 
0.2 J a ' ' ' ' 
0. 1 ? I I 
0.05 0. 15  
FlfUle e.20 
\ \ 1/1 I 
\? /;/ ... \ B  I 
R(B) 
L:l I I I 
0.25 0.35 
A - R(CV) 8 - R(ROT) 
C - R(0.632) D - R(TEN) 
1 1  
1 1  
1 1  
Comparhon of the COUNT'? of Four Error Batlmatlon 
llethoda for cm with Sample Size? of 20 and q=3 
o ? i  ?\ .. \ 
.
. ? . ? 0.5
14 _ _ ? \ o.? ] i/ \ ? . . / \ . ? 
\ 
. 
\ . 
fl 0.3 
', ' ? \ t \ 0.2 j \ \ \ . ? 
e. 
0. 1 ' ' 
I 
0.0 
0.05 0. 15 
FlfUre 8.21 
R(B) 
' ' 
'B 
0.25 0.35 
A - R(CV) 8 - R(ROT) 
C - R(0.632) D - R(TEN) 
Comparlton of the COURT'? of rour Error lttlmatlon 
Method? for cm wltb Sample Size? of 100 and q=l 
Comparllon of the COUNT'? of Pour Brror l?tlmatlon 
Method? for CART with Sample Size? of 100 and q=2 
Comparllon of the COUNT'? of Four Error l?tlmatlon 
Method? for cm with Sample Slzea of 100 and q=3 
Jlpre 8.22 
A - R(CV) B - R(ROT) 
C - R(0.632) D - R(TEN) \ l:rtrure 8.28 
0.8 
0.7 
0.6 
? 0.5 
8 
0.4 
0.3 
0.2 
0. 1 
0 ? ? ? ? ? ?C . 
/.A\ 
/ / ? \ 
/..- "' 
\ \ . ... \ Jf \ \ /.1 \ 
?c 
. I \ \ ;!// \ b 
I 
I 
? 
0.05 
I \ 
I \ 
I \ 
I \ 
I B 
0. 15 0.25 
R(B) 
0.35 
A - R{CV) 8 - R(ROT) 11 A - R(CV) B - R(ROT) C - R(0.632) 0 - R(TEN) Pfaure 8.24 C - R(0.632) D - R{TE:N)j 
These results were fairly similar to those of Crawford ( 1989) , though he used a different MSE 
criterion to the one used here. General ly ,  these results have shown that the R (0.632) 
estimator for C ART was clearly the most reliable estim ator for s m aller samples and 
m arginally the best for larger samples in the case of conditions less favourable to CART. The 
R(CV) and R(TEN) estimators were best for larger samples in situations m ost favourable to 
CART and for l arger R(B). For smal ler samples and situations favourable to CART, there 
was little to choose between the esti mators. The R(ROT) estimator was found to be the worst 
due to the often l arge pessimistic bias. 
As an extension to the studies undertaken here .  it was decided to compare the performance of 
the hol dout estimator, R(H), for n = l OO only.  The R(H) estimator was calculated by using 
two thirds of the original data as the learning sample to grow a classification tree and the 
other third as a test sample to select the tree size and estimate the error rate of the chosen tree. 
In sum m ary, the results showed that the R(H) estimator was unbiased but that the variability 
of the estimator was very large, leading to a consistently  higher M S E  than the other 
estimators. As recom mended by Breiman et a l  ( 1 984), the holdout method should not be used 
with C ART unless the data set is very large. (They recommend a com bined data set of at 
least 1 000 cases.) 
A final analysis in this study was carried out to compare the tree sizes generated by each of 
the three error estim ation methods. Natural l y  the tree chosen by R(ROT) was the same as 
that for R(0.632). The results showed that the method effect (F = 14.08) dominated all others, 
and that on average, the trees produced by the rotation method were the simplest, containing 
3.5 terminal nodes compared with 3.63 for tenfold cross-validation and 4.35 for n-fold cross?
validation. 
6.2.4 Summary 
Thi s  study has shown that trees produced by using the rotation method for tree selection, then 
using the 0.63 2 method to estimate R (T) , the actual error rate, for the selected tree, were i n  
the m ain fairly unbiased , o r  if  biased optimistically, general ly exhibited low vari ability .  The 
R (0.632) estimator was found to be the m ost reliable of all enor estim ation techniques for 
small samples in non-ideal situations (parallel classification problems) and m arginally more 
reliable than o ther methods for smaller samples and ideal situations (sequential classification 
p roblems), w hen R(B), the Bayes error rate, was low to moderate, as well as for larger 
samples and non-ideal situations. Only w hen the classification problem was sequential and 
143 
R(B) was moderate to high did R(CV) and R(TEN) clearly outperform R(0.632). The R(CV) 
and R(TEN) estimates were found to be extremely variable for small samples, although as 
claimed by B reiman et al ( 1 984), R(TEN) did no worse than R(CY).  The R(ROT) estimator 
was found to be overly pessimistic in nearly all si tuations. 
The general trend observed for all estimators was that for small samples, reliability was best 
for either lowest or highest R(B),  but for large samples, reliability decreased with increasing 
R(B). 
6.3 ERROR RATE ESTIMATION FOR CATEGORICAL DATA IN CART 
6.3. 1 Study Plan 
144 
In this section, the various error rate estimation techniq ues were calculated using the same 
d ata sets that were em ployed in Ch apter 5 .  H owever, the fourth level of the probability 
patterns, Pij ?  was omitted as the results for that level were found to be very close to those of 
the third level. This meant that there was u nnecessary replication of the same type of 
classification problem . For simplicity, level 5 was recoded as level 4. 
The aim of this study was the same as that of  Section 6 .2 ,  except that the conclusions apply to 
categ01ical rather than continuous data. Com parisons of the etTor estimation tech niques were 
made by means of the twin cti teria of bias and M SE. In addition. a comparison of the tree 
sizes by each method was made to detetm ine if there were any differences. 
For all the data sets in this section, the one standard en?or rule was used while the size below 
which a node will n ot be split was set to five. Independent test samples of size 5000 were 
used throughout to determine R(T). 
6.3.2 Results 
For p = 5 variables, the spl i t-plot ANOV A showed that the method m ain effect (R, F = 4.36) 
and the interactions of method with probabil ity pattern (R * Pij ? F = 2.74), m ethod with 
sample size (R * n, F = 2.66) and method with probability pattern and sample size 
(R * P ij * n, F = 2.44) were the only signi ficant effects, hence the results are presented in 
terms of the latter. 
As in Section 6.2,  the resu lts for the three apparent en?or rate estimators are not presented in 
o rder to preserve display resolution , as in most s i tuations, the bias and MSE's  of these 
estimators was m uch larger than that or the other estimators. Figu res 6 .25 to 6.27 give the 
values of the bias for the other four estimators when p = 5. For n = 20, the R(0.632) estimator 
was least biased for the first three levels of Pij as well as either R(ROT) or R(TEN). For al l 
four levels of Pij ? R(CV) was highly optimistic, and most d isturbingly ,  underestimating R(T) 
by almost 1 0% in the case of identical popu lations. For n = 60. R(CV) did best for the m ost 
highly separated populations (Pij = 1 )  but for other levels of Pij was consistently the worst 
(overoptimistic) .  The R(0. 632) and R(ROT) estimators exhibited fairly similar trends, being 
fairly unbiased. For n = 1 00,  the R(CV) estim ator did eq ually well for Pij = 1 and 2, but was 
the worst for the other two l evels of Pij ? The other estimators all had very similar error rates. 
In terms of M S E, Figure 6.28 shows that when n = 20, the performance of all estimators 
varied across the levels of Pij ? with R(0.632) doing best for Pij = 1 and 3 ,  a long with R(ROT) 
for Pij = 3 ,  R (CV) for Pij = 2 and R(TEN) for P ij = 4. where R(CV) did worst. For n = 60, 
Figure 6 .29 shows that R(ROT) did marginally better than R(0.632) except for P ij = 1 where 
R(0.632) did n arr-owly better than R(CV). The poor perform ance of R(CV) for Pij = 2, 3 and 
4 can also be seen. For n = 1 00, Figure 6.30 shows a similar pattern to that for the smal lest 
samples in that the R(0.632) estimator had the lowest MSE for Pij = 1 and 3, R(CV) for Pij = 2 
and R(TEN) for Pij = 4. 
For p = 10, the split-plot ANOVA showed that the R m ain effect (F = 14 .58) and R * Pij (F = 
2.48) and R * n (F = 2.32) interactions were al l  significant at the a = 1 %  leveL Hence, the 
results are presented in terms of Pij and n in turn. Figure 6.3 1  shows that in terms of bias, the 
R(0.632) estim ator was best for Pij = 2 and 4. For Pij = 3 ,  R(ROT) did best, o therwise i t  was 
overly pessim istic .  The R(TEN) estimator was narrowly better than R(0.632) for P ij = 1 but 
was over1y optimistic for other levels. 
145 
? Cif ii5 
CompufloA of the Blal of roar lrror lltlmatloA 
llethocll tor CART for Sample Size? ot 20 
p = ll 
0.05 
0.00 
-{).05 
)i 
I \ 
I \ 
/ c \ I . . \ 
I 
. \ 
f1 
. . \ . . \ . \ 
. ?' .\ ? ? 
?:-. ? I ? . \  "' . , 
...--------'\ ?.' \ /?\ \j ? : 
-o.10 L.---?---r----.,..., 
1 2 J 4 
ProbabUlty Pattern Level 
A - R(CV) B - R(ROT) 
? al 
iii 
Comperflon of the Bla? of roar lrror lltlmatlon 
Kethoch for C!IT Sulple Blze1 of GO 
p = ll 
0.00 
-<>.05 
-<>. 10 
? 
\ 
.i\ \ I \ ?\ 
/ 
\ 
\ 
I \ \ 
\ 
? 
? . \ 
/ 
,
?
, 
\ 
'\? ? \ I 
. \ 
. \ I . . . 
. 
j 
\? . . \ I  . ? ;  ? ? 1!1 ? ? \ ?c? ? / c .\ I 
"" I
'"' 
2 J 4 
Probabl l l ty Pattern Level 
? al ? 
Compullon of the Btu of roar Error lltlmatlon 
Method? for CART for S&mple Slzee of lOO 
p = ll 
0.03 
0.02 
0.01 
0.00 
-{).01 
-{),02 
-{).03 
-o.04 
,I?, 
lf\\\ ?f \\\ \ \  \ ? 
. \ \? - --1-a  / . c ;? ? ? 
\ . 
2 J 4 
Probability Pattern Level 
"""? e.? C - R(0.632) 0 - R(TEN) 1 1?11'1'? 8.28 
A - R(CV) B - R(ROT) 1 A - R(CV) B - R(ROT) 
C - R(0.632) 0 - R(TEN) Pff1Ue 8.27 C - R(0.632) 0 - R(TEN)J 
Compul10n of the 1181'1 of rou Brror l1tlmatlon 
lethocl1 for CAlf for SIJDple Slze1 of 20 
p = fi 
? 
c 
0.02 
0.01 
\ \ \ \ c 
\? 
\ : . ' 
? '\ \ .
. 
' 
. 
t!?.:. - - .g 
\ 
0.00-i, ' ' \1 I 
? 
1 2 3 4 
Probability Pattern Level 
A - R(CV) 8 - R(ROT) 
Comparhon of the 181'1 of rou lrror latlmatlon 
lethodl for cm S&mple Size? of eo 
? 
p = fi 
0.015 
0.010 . ?  
0.005 
0.000 
I .e . . ? . '\ I . \ '  . ,\ . _ .g  ?
., . . .s- - -
. . c 
?? . . 
,. "" 
\ . ,. 
? ? ,. e"  
1 2 3 4 
Probability Pattern Level 
Comparhon of the KBI'1 of Fou lrror lltlmatlon 
lethoda tor CABT tor Sample Size? of lOO 
p = fi 
0.006 
0.005 
0.004 
? 
0.003 
0.002 
0.001 
1 2 3 4 
Probability Pattern Level 
Plpre 8.28 C - R(0.632) D - R(TEN) l lrtaure 8.20 
A - R(CV) 8 - R(ROT) I A - R(CV) 8 - R(ROT) I C - R(0.632) D - R(TEN) rtaure 8.80 C - R(0.632) D - R(TEN) 
Compar ison of the Bias of Four Error Es timation Methods 
for CART for Different Probabi l i ty Patterns 
0.04 
!11 a1 -0 .0 1 ..... 
? 
-0.06 
Fi&ure 6.31 
p = 10 
e- - - - - -
- 8, 
' ' ' ' ' ' ?. ? . . . . . . . c . ' ' ' 
\ 
\ 
\ 
\ 
'R .... .... .... .... .... .... 
?? 
\ . . ? ? '\- .-?-&?.........._ . .-? .............. . . ............. . '?) 
2 3 4 
Probab i l i ty Pattern Level 
A - R(CV) 
C - R(0.632} 
B - R(ROT) 
D - R(TEN) 
Comparison of the MSE's of Four Error Estimation Methods 
for CART for Different Probabi l ity Patterns 
0 .009 
0 .008 
0 .007 
0 .006 
? ? 0.005 
0 .004 
0 .003 
0 .002 
0 .00 1  
Figure 6.32 
,B 
I' ' 
p = 10 
I' ' 
I' ,/!7-?-,-?-?-B-..... I' /  ' .............. /::,/ ' 
.............. . 
// ', '--o 
. c  . . 
' ' 
' . ? \ , .  ? c . ' . ' . . . . 
e- - - - - - .:. ?  
2 3 4 
Probabi l i ty Pattern Level 
A - R(CV) 
C - R(0.632) 
B - R(ROT) 
D - R(TEN) 
Figure 6.32 shows that i n  terms of  MSE, R (0.632) was clearly best for Pij = 1 and 2, w hile 
R(ROT) was the most reliable for Pij = 3 and 4. R(CV) was found to be the least reliable 
estimator in all situations. Interestingly, these results mirror those found for continuous data 
in that R(0.632) was shown to be the best estimator for parallel classification problems which 
were not suited to CART as well as for the best separated populations (Pij = 1 and 2). Note, 
though, that the rel iabil ity of the R(0.632) estim ate exhibited a distinctive U shape pattern, 
doing best for either well-separated or identical populations. 
Figure 6.33 shows that R(0.632) was the least affected by changing sam ple size and was the 
least biased for small samples. The high optimistic bias of R(CV) for small samples is c learly 
evident and remains optimistic for larger samples. 
In terms of MSE, Figure 6.34 shows that R(0.632) did best for n = 40. For n = 1 20, R(0.632) 
was slightl y  better than R(ROT) while for n = 200, R (0.632), R(ROT) and R(TEN) did 
equally well .  The R(CV) estimator for all sam ple s izes performed consistently the worst. 
A final analysis in this section compared the sizes of the trees resulting from each of the three 
error estimation m ethods for both p = 5 and p = 10 binary variables. For p = 5 variables, the 
split-plot ANOVA showed there to be a difference between methods (F = 7 .69) with R(ROT) 
producing the smallest sized trees (2 .03 terminal nodes) and R(CV) the largest (2.64 terminal 
nodes). For p = 1 0, again the method effect (F = 7.7) dominated all o thers with the R(ROT) 
estimator producing the smallest si zed trees (2.9 terminal nodes) on average. The average 
sized trees produced by R (CV) and R (TEN) contained 5.07 and 4.44 terminal nodes 
respectively impl ying that the rotati on method produced trees with 1 .5 less terminal nodes 
than any other method. 
149 
Comparhon of the Diu of Four Error Eetlmatlon 
Kethode for CART for Different Sample Size? 
0.04 
0.03 
0.02 
0.01 
0.00 
... 
?S ?- -o.01  ,Q 
-o.02 
-o.03 
-o.04 
-o.os 
-o.06 
Plaure 8.3S 
e.,  
' , 
' 
' 
' 
' 
' 
'
, 
p = 10 
' 
' 
'8.. ... ... ... ... ... ... ... 
???. ? .  ... B ?"" ............ : . . 
./
./ 
.,.? . . . . 
./
./ "?? ? ? ?c 
.../ .
, 
1.:1 'f) 
40 120 
Sample Size 
200 
A - R{CV) 
C - R(0.632} 
-?- - -
B - R{ROT) 
D - R(TEN) 
? 
Comparhon of the MSE' 1 of Four Error Estimation 
Kethode for CART for Different Sample Size? 
0.0 10  
0 .005 
0.000 
p = 10 
? 
"?" '" B ,  ' 
G . . 
4() 
.... .... '?, ' , , 
'
?, .... "
?'-.... .... , ??--
?--
. . . . .... '? ?--?--?--.. 
. . . - - - - . . . . 
. 
. . . . . . ..... _ - ? ... - -'? 
1 20 
Sample Size 
200 
A - R(CV) 8 - R(ROT) 
Flaure 8.3-' C - R(D.632) D - R(TEN) 
6.3.3 Summary 
This study has shown that for categOiical data, the R(0. 632) and R(ROT) estimators produced 
the m ost reliable classification trees as wel l  as being the simplest. It was found that, as for the 
continuous data in Section 6 .2 ,  the R(0.632) estimator was the most rel iable for small  
samples, paral lel  c lass ifi cation problems and well separated populations. As wi th the 
continuous data, and in agreement with other studies, the R(CV) estimator was found to be a 
p oor estim ator for small samples and h ighly optimistic, especially for poorl y separated 
populations. B ased on the results of this study,  the use of n-fold cross-validation with 
categorical data is not recommended. 
).4 THE STANDARD ERROR RULE IN CART 
6.4. 1 Previous Studies 
The motivation for doing this study was provided by B reiman et al  ( 1 984) .  They 
recommended the use of the one standard error ( 1 -SE) rule for, firstly, the sake of accuracy, 
n oting that in m ost cases , the cross-vali dation estimate of error was over optim istic. 
Secondly, they stated that a plot of R(CV) against tree size had the characteristics of an initial 
sharp decrease fol lowed by a long, tlat vaJJey across a wide range of tree sizes and then an 
i ncrease for very small trees. Inside the long valley, most en?or rates were found to be within 
the ? 1 -SE range and that the position of the minimum may be unstable. The 1 -S E  rule was 
used to reduce that instabil ity as well as produce trees which are as simple as possible. 
Feng et al  ( 1 993) caiTied out a small-scale empirical study compa1ing the zero standard error 
(0-SE) and 1 -SE rules using various data sets. They found that trees produced b y  the 0-SE 
rule were between two and ten times larger than those constructed using the 1 -SE rule, so that 
the latter were b iased towards simplicity. In determining which rule was better they were 
rather i nconclusive. "We believe that there is no single best rule, instead i t  depends on how 
m uch "noise" there is in the data. If there is l ittle noise in the data, then the 0-SE rule should 
be used. If there is  a lot of noise then .. . the 1 -SE rule should be used." (ibid, p 49.) 
151  
152 
6.4.2 Study Plan 
I n  Section 6.2,  the performance of various error rate estimators was compared i n  estim ation of 
the actual error rate using the 0-SE rule. In this section ,  only two estim ators, R (CV) and 
R (0.632), were used, one of which worked best in any one of the factor comb inations studied 
in S ection 6.4 .  The two error estimation m eth ods were compared o ver the factor 
combinations studied i n  Section 6.2 involving the Bayes en?or rate, R(B), sam ple size, n ,  and 
the third factor q, using both the 0-SE and 1 -SE rules, with the objective of determ ining in 
which situations eitJ1er of the above two mles should be used. Comparisons were made using 
both the bias and MSE perfonnance crite1ia. In addition, a compa1ison of the decrease in tree 
size produced by using the I -SE rather than the 0-SE rule was made for q = 3 only.  
6.4.3 Results 
Figures 6.35 to 6.37 compare the average bias for n = 20. For q = 1 and 2, it is c lear that both 
the R(CV )  and R(0 .632)  estim ators using the 0-SE rule were less b iased than the 
corresponding estimates using the 1 -SE rule except for R(B) = 0.05 . For q = 3, there was 
l ittle to choose between the use of the 0-SE or I -S E  rules, except the R(0. 632)  estimate for 
larger R ( B )  which was excessively pessi m i stic. Note too that the two R(CV) estimators 
exhibited very similar trends as fu nctions of  R(B)  while the two R(0.632) esti m ates did not. 
For q = 1 and 2, the average bias decreased as R(B) i ncreased using the 0-SE rule while the 
bias increased as R(B ) i ncreased using the I -SE rule. 
Figures 6 .38 to 6.40 il lustrate the cases of n = I 00. For q = l ,  2 and 3, the estimates using the 
1 -SE rule were generally less biased than those using the 0-SE rule, with the l atter tending to 
be over optimistic. For q = 3, the dispa1i ty in bias between the 0-SE and I - S E  m le estimates 
was less m arked for low R(B) than high R(B).  As with n = 20, the R(CV) estimates followed 
similar patterns while the R (0.632) estimates behaved rather differently. 
In terms of MSE, Figures 6.4 1 to 6.43 il lustrate the cases of n = 20. The trends shown are 
very s imilar to those for bias. For q = 1 and 2, the 0-SE estimates produced the lowest MSE 
except when R(B) = 0.05. For q = 3 ,  the difference between methods was m arginal . 
---------- , 
Compuhon of the Blaa of Two lrror latlmatlon 
11etho41 tor CART with eel without the One 
studucl lrror Rule tor Sample SJze1 of 20 aJlcl q=l 
? .. = 
-
0.14 
0. 12 
0.10 
0.08 
0.06 
0.04 
I 
' / 
/' 
[("'. . c . . . . . . e . e ? . . . . 
-A 
0.02 
?c 
0.00 --r 
0.05 0.15 0.25 
R(B) 
0.35 
A - R(CV) B - R(CV) - 1SE 
eom.,.rhon of the Btaa of Two lrror letlmatlon 
Jlethocb for cm with eel 1fJthout the One 
Stu.dard lrror Rule for sample Slzea of 20 eel q=2 
0. 10 
? .. iQ 0.? 
0.00 
?-------?-?? -?---a--?---., 
0.05 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I \ 
I 
I 
I 
I 
\ 
I 
I 
I 
I 
I 
. . c . 
\ 
0.15 0.25 
R(B) 
\ 
\ 
\ 
\ 
\ 
\ 
\ 
\ 
/? . I 
'C 
0.35 
A - R(CV) B - R(CV) - 1SE 
--??-?- -?----? 
- -------- ----? 
Compe.rhon of the Blu of Two lrror latlmatlon 
)(ethoda for cm '11th ed without the One 
standard !rror Rule for Sample Slzea of 20 ed q=3 
0. 15 
0. 10 
? Cll = 
0.05 
0.00 
f\. 
I \ 
I ? 
I c /';,c 
I :
?{ 
?I 
I / 
I / I t. /1 1 ? 
ted.? 
0.05 0. 15 0.25 
R(B) 
0.35 
A - R(CV) 8 - R(CV) - 1SE 
c - R(0.632) D - R(0.632) - 1SE I Iril'l'e 8.37 rlpre 8.36 C -_Rj?SJ2) _E._ - R(0.632l_:_? l!JIUI? 8.38 - -??-- C - R(0.632) D - R(0.632) - 1SE 
- -- --?? -???----.- .. -.. -----?-? .. ??-??? ? .. . ?----?-? -? ---.. ? - ?  ?-? ???l 
CompuhoD of the Blaa of Two lrror latlmatloD 
llethocla for CART with llD4 without the One 
Stu4ar4 lrror lute for Sample Size? of lOO and q=l 
0.03 
0.02 
0.01 
0.00 
? ? -
= -o.o1 
-o.02 
I 
I 
I 
I 
I 
,A 
? c ? . . . . . ? I 
! 
I ?. ' -o.03-l ?--... 
e 
-o.04-, 
r -' 
0.05 0.15 
R(B) 
0.25 0.35 
... .... - __ _. .............. ? ?--- - ? - - -- - - - - ???- - -- - - ?  ?-----? ...... l 
ComparhOD of the Blaa of Two lrror laUmaUoD 
lle\hodl for cm wltb and wt\hou\ the One 
standard lrror Rule for Sample Size? of lOO and q=2 
? 
? .... "' 
s. .... 
\ ,. -- ---?-?----?--?-- ---l 
o.oo -l'\. .... .... .... a, 
-o.01 
-o.02 
-<>.03 
0.05 
"' ' . ' 
?- ', 
?-?
"'-0----, ?--?-f) B \ 
\ 
c 
1:: 
0. 15  0.25 
R(B) 
\ 
\ 
\ 
' 
' 
\ 
\ 
' 
B ? c 
0.35 
A. - R{C'J) B - R{<N) - 1SE\ \ A. - R(CV) 8 - R(CV) - 1SE 
. ?--? ... . ???- -- ?---- ? - ...... -... . -.. -..., 
ComparhoD of the Bfaa of Two Error laUmaUoD 
lletbo4a for CART wt\h and Yl\hout tbe One 
Studard Error Rule for Sulple Size? of 100 and q=3 
... ? = 
0.00 
-o.Q1 
-<>.02 
\??=?\:??-?---?) 
?\ r;r'/ i3 
0.05 
' r 1 
' /' / 
)\ ,' 
\ \ I . ? \ \/ /1 
. V I b . . . . . . a . . . . I . ?c I 
?---T-----f 
0. 15  0.25 0.35 
R(B) 
I 
I 
I 
I 
I ( 
i 
l 
lli'Cl?? _!:?- ?- R{0.632) .?_-:_??0:?) _::-_??J l!!!_u? 8.S!_ ____ ?_:-?(0.63?_E..:.??-63?=-? jwtaure 8.40 
A - R(CV) B - R(CV) - ;}SE 
C - R(0.6.32) D - R(0.6.32) - 1SE 
M-- -- -??-???- ? ... ---. 
-? -- -- -----
.. ? .. ?-- ??--- ? ? ?--?-?---.. ------ --?-?-? ? - ---.. ------l 
Coa.-rliOD of the IISI't of Two lrror latlmatlon 
llethoclt tor CART w1th and without the One 
Stanclar4 lrror Rule tor Sample Size? of 20 an4 q=l 
; 
0.03 
0.02 
0.01 
? I \ I \ I \ I \ I \ I \ I \ Q 
1 , r\ ,' ? \ / li P\ \ 
I \ I \ \ 
I I 
\ . 
B 
' \  
I 
, .  
/.} .c '\ 
/ 
. - ?  . . . . (. . . . c . ? ?  . 
0.00 '-r-----r----....,..- --,--1 
0.05 0.15 0.25 
R(B) 
0.35 
I 
1 1  
1 1  
.. - - - - ?- --.. ---?-?---???--------???? ?-??---???-- .. ---?--?-?? ?-
l 
Comparlton of the IlBra of Two lrror lltlmatlon 
llethoclt for CART wlth ud without the One 
st&D.4ard lrror Rule for Sample SJzet of 20 ud q=2 
? 
0.025 
0.020 
0.015 
0.0 10 
0.005 
l\ I \ I \ I \ I \ I \ I \ I \ 
I \ / 1\ 
\ 
p \ \ \ 
/rx: /' . // . ? ? 
/' ? . . . ? C ? lr? . . . . . c- ' 
0.000 L-,' ?r-- --r- 
0.05 0. 15 0.25 
R(B) 
0.35 
Com.-rhon of the IISI't of Two lrror lttlaatlon 
)(ethoch for cm w1th &D.d without the One 
Stanclard lrror Rule for Se.mple Slzet of 20 ud q=3 
? 
0.03 
0.02 
0.01 
/? 
; \  I \ . \ I ,A. \ ! ..... --: :? ... .:.. . \ 
B . ._, !!> 
I I 
. ? ?  
I 
. ---- ?? ?'lr--" ?"'""-fl( 
I 
I 
I'/ 
o.oo -?--r 
0.05 0. 15 0.25 
R(B) 
0.35 
,.,_.,._?!?.-----?-? R????t??? ?.???;! .. ; ?;J haur? ?:??-? ?----?-?-?? .. ??i?!?_;._?(-???-; _ _  ?;J [,? ? 8.43 ____ ____ _ :_?-?:???:?? . .  ?-??????-;. ?? 
1 56 
For l arger samples, Figures 6.44 to 6.46 show trends slightly different to those for bias. For 
q = 1 and 2, one of the 0-S E  estim ates was lowest i n  terms of M S E  for higher R(B),  with 
R (0.632) using the 1 -SE rul e  having a very high MSE, while for h igher R(B) (R(B) ;:::: 0.25), 
the 1 -SE estimates were more rel iable. For q = 3, the 1 - SE estimates were equivalent to those 
using the 0-SE rule for R(B)  = 0.05 and better  than the 0-SE for R (B)  = 0. 1 5  and 0.25. For 
R(B) = 0.35,  the 0-SE estimates performed best. 
These results would tend to suggest that sam ple size plays an i m portant part in determining 
the choice between the 0-S E  and 1 -S E  rules for CART. as wel l  as if there is any n oise in the 
data or not. B ased on these results ,  the recom mendation is to use the 0-SE rule with very 
small  d ata sets for paral lel c l assifi cation prohlems involv ing l i ttle or no noise, while for 
sequential c lassification problems and some noise in the data .  the 1 -SE rule is preferred on the 
grounds of simplicity. For l arger samples. the 0-SE rule should be used for well separated 
populations in cases i nvolving paral lel classification problems. For sequential c lassification 
problems, the 1 -SE rule should be used unless the popu lations are not well separated. 
Comparin g  the tree sizes obtained by using the I -SE  ru le instead of the 0-SE rule, showed 
that overal l ,  for both methods, n h ad a very large effect (F = 1 2 .94) compared with al l  other 
effects and i nteractions. This impl ies that sam pk size was a major factor i n  determining if 
tree size decreased or not with the use of I - SE rule. In fact. the overal l  increase in tree size 
was one tenninal node l arger for l arge n than for small n .  
I nvestigation of the p robabi l i ty model stratum of  the ANOV A showed there to be no 
significant method effec t  or method by factor interactions, therefore the decrease in tree size 
resulting from using the 1 -S E  rule was no di fferent for either of the two error estimation 
methods. On the evidence here it would appear that tree size was l ittle affected by using the 
0-SE rule i nstead of the 1 -S E  rule, certai nly less than suggested by Feng et a !  ( 1 993), though 
for larger samples with more vaiiables, the increase may be m uch greater. 
?-- ???----?- -------?-?-----?- ??-?-?---?l 
ComparliOD of the 1181'1 of 'lYo lrror lltlmatlon 
lethocll for CART Ylth u4 Ylthout the One 
Stuclar4 lrror Rule for Sample Slut of 100 and q=l 
0.010 
0.009 
0.008 
0.007 
0.006 
? 0.005 
0.004-
0.003 
0.002 
0.001 
???---. .  ---?? ..... -?-??-?-.. . . _ .... ?----, 
?\ I \ 
\ \ 8 
\ ? --/ \ " / c \ ?;/??' 
. . 
! 
/\_.< ./ I , " /' \ I 
? ? . L. ? . . ?8 
. ? :..:..:-Y I j
0.000 . 
0.05 0.25 0.35 
Flpr? 8.44 
- _._ . ... _.? .. . ----.. -?--?----?- ? ? ---?--- .. ?- --- . . .  ---- ?? ... -, r .. ? - ? ---- .. . .. .... -.
.. .. .  
-.. . . ..  ?? ?--? ..... -.... ..... . . .. -.. .. . . ....  -.. - .. . ??- ? ??-, 
I 
Comparhon of the ISI'a of 'lYo lrror lltlmatloD Comparlton of the ISI'? of Two lrror lltlmatlon ! 
)lethocll for CAJT Ylth and Ylthout the OJle letho41 for CART Ylth and Ylthout the One ! 
Standard lrror Rule tor Sample Slzee of 100 and q::2 Standard lrror Rule for Se.mple Size? of 100 and q=3 I 
0.006 
0.005 
0.004 
m 0.003 
0.002 
0.001 
0.000 
------'" --- x?-?--- - .. ?-- ---------?l 
! \  
I \ 
I \ 
I 
\ 
I \ /? 
I 18 ... \ _/-/' ' .C 
V
I Jc-?-::;,..-?/ / . . 
... " 
.
. 
. . 
?
??- - - ....... . ? .c? . . . -? . 
- -?,  
0.05 0. 15 0.25 0.35 
0.004-
0.003 
? 0.002 
I 
I 
I o.oo1 I 
. - -??? - -- - ? ?- -,p1 
I /.?I I . . . I 
1 . - I 
?/ J3 . I 
.c/ , 
.
? '/ i 
? . I 
I 
.
?
? ) 'I  i 
8 . 
I .  
" .... . /" 1/ 
, "/???  r ?/_ . r::. / ?- I 
("_,.;---" '""' I 
o.ooo ? , --r--???--? 
0.05 0. 15 0.25 0.35 
! i ! 
6.4.4 Summary 
The results from this particular study have indicated that when using either the R(CV) or 
R(0.632) estimates to both select tree size and calculate an "honest" estimate of R(T), the 
1 -S E  rule should not be used for either small samples or when there is l i ttle noise i n  the data ' 
unless the populations are not well  separated. For situations where there exists a lot of noise 
in  the data, the 1 -SE rule is preferred unless the populations are not well separated. 
6.5 TRANSFORMATIONS OF ERROR RATES 
6.5. 1 Study Plan 
158 
The p revious sections have deal t with the analyses of untransformed error rates so a 
d ifference of 0. 1 from R(T) = 0.05 was treated the same as a difference of 0. 1 from 0.35. 
This seems a somewhat unfair and inappropriate com parison. as suggested by Fitzmaurice et 
al ( 199 1 ) , in that the former difference should recei ve more weight than the latter (see Section 
4.5 .3) .  
I n  this section, two transform ations of the error rates were tried to try and right this 
i mbalance, namely the logit and proportion transformations. For the logit transformation, 
R(T) was replaced by LR(T). where 
LR(T) = ln[R(T) I (1 - R(T))] 
and its estimate, RCD . by 
A A A 
LR(1) = ln[R(T) I ( 1 - R (T))] 
whi le for the propo11ion transformation, R(T) was replaced by PR(T), where 
PR(T) = (R(T) - R(T)) I R(T) = 0 
and its estimate, R('D, by 
A A 
PR(T) = (R(T) - R(T)) I R(T). 
For example, if RCD = 0.05 and R(i) = 0.2 
LR(i) - LR(T) = ln[0.2 I ( 1 -0.2)] - ln [0.05 I ( 1 -0.05)] = 1 .558 
PRCD - PR(T) = (0.2 - 0.05) I 0.05 = 3 
while if R(T) = 0.35 and R(i) = 0.5 
LR(i) - LR(T) = ln[0.5/0.5] - ln[0.35/0.65] = 0.6 1 9  
PRCD - PR(T) = (0.5 - 0.35) I 0.35 = 0.429. 
From these resu lts i t  can be seen that using the proport i on transformation has the greatest 
effect on the en?or rates, as the magnitude of di fferences between (0.2 - 0.05) and (0.5 - 0.35) 
is 7 and 2 .5 1 7  for the proportion and logit transfonn ations respectively. 
The two transfo rm ations were used on the R(CY).  R(ROT), R(0.632) and R(TEN) error rate 
estimates calculated in Section 6.2 with the i n tent ion of determining what d ifferences, if any, 
appeared in the MSE's for all four estimators. 
6.5.2 Results 
As with Fitzmaurice et al ( 1 99 1 ) ,  the two transformations had very s imilar effects on the 
patterns of MSE' s. Therefore, only the res u l ts for the proportion transformation are 
demonstrated here. The results for n = 20 appear in Figures 6.47 to 6.49 and differ from the 
untransformed resu lts, given in Figures 6 . 1 3  to 6. 1 5 , in a number of respects. Firstly, al l  
estimators now exhibit the general trend of an in i ti al sharp decrease in  MSE going from 
R(B ) = 0.05 to 0. 1 5  then a gradual decrease from R(B) = 0. 1 5  to 0.35. Note, though, that the 
MSE's for the R(0.632) estimator were least affected by changes in the values of R(B).  For 
q = 1 and 2, the differences between the R (0.632) and other estimators for R(B )  = 0 .05 were 
accentuated.  As with the untransformed scale, the R(0.632) estimator did best when q = 1 
and 2, except for high R(B), while no single estimator was best for q = 3 .  
159 
For larger samples, a sl ightly differen t  trend than appeared wi th smaller samp les is 
highlighted in  Figures 6.50 to 6.52, with the initial decrease in  MSE for all estimators, except 
R(ROT), being not as large as that for small samples and increasing MSE for R (B )  = 0.35. 
As with smaller samples, though, the pe1formance of each estimator is clearly defined for low 
R(B ) .  R (0.632) did best, in the cases of q = I and 2, for moderate R(B), and h igh R(B) for 
q = 3. 
6.5 .3  Summary 
The results reported here were very s imi lar to those given by Fitzmaurice et al ( 1 99 1 )  using 
LOA. A comparison of the MSE' s of the four  en?or methods was not greatly affected by the 
transformations . However, as recorded by Fitzmaurice et a! ,  the methods now performed best 
for high R(B) and worst for low R(B) in contrast with the untransformed results where for 
small samples, MSE was highest for moderate R(B ) .  while for larger samples, it was largest 
for high R(B). 
6 CONCLUSIONS 
In this chapter, simulation study results have shown that the R(0.632) method for estimating 
the actual en?or rate when using CART performed well in most si tuations for both continuous 
and categorical data. For continuous data. the R(0.632) clearly had the lowest MSE for 
smal ler samples and parallel classification problems and marginally lower MSE for smaller 
samp les and sequential classification problems as well as larger samples and parallel 
classification problems. Only when the classi fication problem was sequential and the 
distance between populations was moderate to la rge did o ther techniques outperform 
R(0.632). 
For categorical data, most of the trends noted above were also observed. The R(CV) 
estimator, as for continuous data, was found to be a poor estimator for small samples and 
highly optimistic for poorly separated populations. For both continuous and categorical data, 
the R(0.632) estimator (R(ROT)) was found to produce the smallest sized trees. 
161 
C.,.rlloa of tlul Tnutor_. JiBre of rou 
lrror ?ttaatlola lletlaot1 tor CdT with Buatlt 
81111 of 1.0 u4 plr ProporUoa Tnufo,..tloa 
? 
1 .0 
0.5 
G. 
'0 . . . . . . 
0.0 '-r--.,-----....----..J 
0.05 0. 15 
R(B) 
0.25 0.35 
A - R(CV) 8 - R(ROT) 
O.pultoa of tM Tnulo ..... Mill'? of rau 
lrror IIUMtloa lletWJ for CUT will llulplt 
811111 of 'lO u4 .-az Proporttoa Trudor?tloa 
1 .0 
0.9 
0.8  
0.7 
0.6 
llQ ? 0.5 
0.4 
0.3 
0.2 
0. 1 
0.0 
0.05 0. 15  
R(B) 
0.25 0.35 
A - R{CV) 8 - R{ROT} 
O.puliOil of U.. Trudo,... &'1 of '"' 
lrror lltlMtlcallltWI for CUT wltllllaplt 
sr .. of 2.'0 u4 tiiSl Proportlca Tnuloratloa 
0.8 4 ?  
I 
I 
0.7 4 I I 
I 
0.6 -l \ \ 
\ \ 
0.5 4 \ \ 
? 0.4 ? . \ \  \ \  
0.3 . ? \\ 
0.2 
0 . 1  
0.0 
0.05 0. 15 
R(B) 
0.25 0.35 
A - R(CV} 8 - R{ROT) Fl&ure 8.47 c - R(0.6J2) D - R(TEN) I IFlaure 6.48 C - R(0.632} D - R(TEN) c - R{0.6J2) D - R{TEN} I IFlaure 6.49 I 
Colapul?oa of t1ae fralufor ... 1181'1 of rcnar 
Jrror lltl?U? lletlao4e for COT with lluaple 
8lal of lOO u4 pll ProporUoa fralufonaatloa 
0.2 -fq 
\ ' 
\ 
\ 
\ 
\ 
\ 
\ 
\ 
\ 
\ 
\ 
I! l e  ' 51 0.1 i ?. \ \ 
\ 
\ 
\ 
\ 
? ' ' ' 
o.o ..,_---r----r----T 
0.05 0.15 0.25 
I( B) 
0.35 
A - R(tv) B - R(ROT) 
O.puleoa of tM fnalfo,... ...  of rcnar 
lrror lltJJIIUOD lleUao4l for CAIT with Buaple 
m.. of 100 ut tal: Proportloa Trultor-uoa 
0.2 
a 0. 1 
ff 
' \ \ ' ' \ \ \ \ \ 
' 
\ \ \ \ \ \ \ 
?- \ 
o.o ..,_---r----r---.,... 
0.05 0.15 0.25 
R(B) 
0.35 
O.puliOil of U.. Traufor-.4 11111'1 of PO'Ir 
lrro: IIU..Uoallethocll for CAIT with 8ulple 
Slsee of 100 &M taS: ProporUoa Traufonaatloa 
I 
0. 15 
0. 10 
0.? 
ff \ \ \ \ \ \ \ \ \ \ ' \ \ \ 
' \ \ \ \ 
? .? \ ' .\,, ...e. ... ... ,. " ...... , ,., . . . C? . . . . . @ 
o.oo 'T---.....---.....---........ 
0.0!5 0. 15 0.2& 
R(B) 
0.35 
llpre 1.10 C - R(0.832) D - R(?l lft_.ue e.at 
A - R(CY) 8 - R(R>l) I A - R(CY) 8 - R{ROT) C - R(0.832) D - R(TEN) 'ftpN e.u C - R{O.tm} D - R{TEH}J 
In studies comparing the use of the zero and one standard error rules in CART, it was found 
that the one standard en?or rule should not be used for either small samples for when there is 
little noise in the data, unless the populations are poorly separated. In all other situations, the 
one standard etTor rule is the prefeiTed method. 
Finally, a transformation of the error rate scale produced results which were not unexpected. 
For large differences between popu lations, that is low Bayes e rror rates, the d ifferences 
between error estimation techniques were accentuated from the c ase of u ntransformed error 
rates. 
Therefore, the techn ique of using the rotation method to select tree size then using the 0.632 
method to estimate the actual error rate of a data set is recommended as a quick, easy and 
reliable technique when used with CART decision trees. However, as mentioned by  
Crawford ( 1 989), the user should not be  constrained to using  one method to select the right?
sized tree, but instead, with a mixture of common sense and prior knowledge of the domain, 
make a sensible tree selection. 
163 
164 
CASE STUDIES 
. 1  INTRODUCTION 
In this chapter, a n umber of the classification methods outl ined in  Chapters 2 and 3 ,  are 
appl ied to 24 real-world data sets and compared by means of a number of criteria to be 
outlined later in this chapter. These data sets are used to either validate or not some of the 
conclusions reached after the simulation studies undertaken in Chapters 4, 5 and 6. 
Later in this chapter, a comparison of various tree-based methods is made for one particular 
data set. 
.2 PREVIOUS STUDIES 
Ildiko and Lanteri ( 1 989) compared LOA. QOA. SIMCA (a form of QOA) and CART on 
four data sets selected from various fields of chem istry. They concluded that no overall 
method was superior in tenns of prediction error. They also recommend that the type of data 
structure involved should be explored and then to choose the optimal rule for that particular 
type of data. If in doubt, several different methods shoul d  be used and compared. 
Lynn and B rook ( 1 99 1 )  undertook an empirical study comparing the performance of 
traditional discrimination methods with CART on twelve predominantly multivariate normal 
classification problems, differing in sample size, dimension and modality. Subsequently, it 
was found that for only three of the data sets was the assumption of equali ty of variances 
valid,  hence, it was decided to use only LOA to compare with CART. For the other nine data 
sets i nvestigated both QOA and kernel density estimation were canied out. I n  all cases, 
comparisons were made by means of n-fold cross-validation. The find ings of this paper 
suggested that CART does not perform as well as discriminant analyses in cases where the 
data set is small and/or simple but does perform at least as well as discriminant analysis in 
most  cases where the data set is larger and/or complex (mul ti -modal, non-normally 
d istributed and/or high-dimensional ) ,  espec ial ly where the covariance structure is 
heterogeneous. 
165 
Feng  et al ( 1993 ) reported a number of papers from the l iterature wh ich have compared 
various classification methods. However. they noted a number of problems common  to the 
papers referred to above. such as applying different methods to data sets which were not the 
same, and using old versions of some methods while using the latest versions of others. In 
their  study, Feng  et al as mentioned earl ier, compared a large number of c lassification 
methods for eight data sets involving industrial applications. Generally, the data sets were of 
much larger sample size and dimension than those used by Lynn and B rook ( 199 1 ). "In 
conclusion, it seems that there is no one part icular algorithm or one particular method 
superior to the others on all the data sets. There is indication from our results that which 
algorithm performs best depends on the characteristics of the data sets. Our work is, 
however, incomplete in the analysis of such dependent relationships". (Feng et al , 1993, 
p 5 1 ). 
B rown et al ( 1 993) compared CART with neural networks, a method which uses mul tiple 
layers of processes. Each processor produces a weighted non-linear function of the variables. 
Their comparative studies were carried out for several multi-modal classification problems 
and found that the two methods produced classification rules with comparable error rates, but 
CART is preferred for data sets with a large number of irrelevant or noisy variables and when 
the ratio of sample size to dimension is small. 
7.3 COMPARATIVE STUDIES 
166 
7 .3. 1 Methods and Data Sets 
Five classification methods were used in this sllldy. involving two categories of methods: 
( 1 )  Traditional d iscrim ination methods, which include LOA, QDA and kernel density 
estimation. 
(2) Decision tree-based methods, which include CART and FACT. 
Twenty four data sets, described in Tables 7. 1 and 7 . 2, were chosen for the purpose of 
compa.J.ison. All the data sets were a convenient selection of published data. Twelve of the 
data sets were used previously in a comparative study undenaken by Lynn and Brook ( 199 1 ).  
Those data sets, however, all contained conti nuous variables, nine of which were 
approximately normally d istributed. The additional twelve data sets used here contain a 
wider variety of data types, including some data sets involving b inary and ordinal categorical 
variables. 
Table 7.1: 
A. M ammo: 
B.  Marks l : 
C. M arks2: 
D. Marks3: 
E. Digit: 
F. Birth : 
Des cri ption of data sets (bl ock 1) 
This problem involves an attempt  to discriminate between women's 
expe1iences with mammography (three levels) based on five variables, 
describ ing  their knowledge. at t i tude and behavi our  towards 
mammography. 
Source: R 1 Zapka and Ms D Sports, University of Massachussers. 
Division of Public Healrh. 
This involves discriminating between males and females based on their 
Grade Point Average at university and five pre-university academic 
variables. 
Source: Moore and McCabe ( 1 989). 
This involves discriminating between three groups of students with 
differen t majors on the basis of the same six variables in B .  
Source: Moore and McCabe ( 1 989). 
This involves discrimi nating between six groups of s tudents with 
different sex and/or majoring subject on the basis of the same six 
variables in B .  
Source: Moore and McCabe ( 1989). 
In this example, the data are generated from a faulty calculator. Each of 
the seven lights (X 1 ,  . . . ?  X 7) of the digit display has 0. 1 probability of not 
doing what it is supposed to do. The problem is an attempt to distinguish 
between the values 1 to 1 0, which occur with equal probability. 
Source: Breiman er a /  ( 1 984). 
For this set of data, an attempt was made tO d iscriminate between 
overweight and underweight  babies based on various m edical and 
demographic variables relating to the mother. 
Source: Hosmer and Lemescho>v ( 1 989). 
167 
168 
G. Family: 
H. Iris: 
I. Enures: 
J. Blood l :  
K. Pinetree: 
L. Wheat: 
This problem was analysed by Kumar ( 1 993), containing information on 
the type of contraceptive device used by 1 74 Indian couples. An attempt 
was made to discriminate between the four types of contraceptive device 
based on the values of twelve vatiables collected from each couple. 
Source: Family Planning Association of India. 
This is the classic problem posed by Fisher i nvolving d iscrimination 
between three al lied species of i1is based on four  measurements relating 
to the size of the iris. 
Source: Fisher ( 1936). 
One method of treatment of enuretic children involves an alarm buzzer 
which wakes the child whenever a bed becomes wet. It was proposed to 
investigate whether the outcome of the treatment could be predicted from 
seven measurements,  where the possible outcomes are 1 = fail, 2 = 
relapse after apparent cure and. 3 = long te1m cure. 
Source: Dr Sylvia Dische (from Hand ( 1 981 )). 
In the context of genetic counsel l ing, the question of d iscriminating 
between nOimal and haemophil ia A can?ying women was considered on 
the basis of two variables. 
Source: Habbema, Hermans and Van Den Broek ( 1 974). 
This data consists of the measut'ements, in cen timetres of 60 pinetrees 
which were felled in  three different areas of the forest. For each tree, 
measurements were taken on four posi tions . The problem involves 
distinguishing between trees grown in each of the three areas. 
Source: NZ Foresrry Department. 
This problem involves discriminating between two varieties of wheat on 
the basis of six measurements taken from a sample of the two species. 
Source: Indian Agricultural Research Institute, India ( 1 972). 
M. Biomass l :  
N. Biomass2: 
0. Biomass3: 
P. Compl :  
Q .  Employ: 
R.  Urinary:  
This problem involves discriminating between three different islands on 
the basis of the growth of spartina biomass and four different chemicals 
from each of the three islands. 
Source: Rawlings ( 1 988). 
This involves distinguish ing between three different types of vegetation 
cover on the basis of the same five variables in M.  
Source: Rmvlings ( 1 988). 
This involves distinguishing between nine different location-vegetation 
types on the basis of the same five variables in M. 
Source: Rawlings ( 1 988). 
Users of the University of London Computer Centre are divided into 
non-medical and medical users .  An attempt is made to distinguish 
bertween the two based on the numbers of units of computing used under 
two different operating systems. 
Source: Hand ( 1 981 ). 
This problem involves discriminating between three groups of countries 
(North-Western, Southern and Eastern Europe respectively) on the basis 
of the percentages of the labour force employed in nine different types of 
industry. 
Source: Euromonitor ( 1 979). 
This  p roblem involves d iscr im inating between homosexual and 
heterosexual males on the basis of two chemical measurements taken 
from urinary samples. 
Source: Margolese ( 1 970). 
169 
170 
Table 7.2: 
S. Blood2: 
T. Lingual: 
U. Sparrow: 
V. Comp2: 
W. Beetle: 
X. Nuclear: 
Description of data sets (block 2) 
Thi s  is the same problem as J except that the two d iscriminating 
vatiables are logged (base I 0). 
This problem involves discrimination between a set of 27 children who 
had an inborn error of metabol i sm  known as transient neonatal 
tyrosinemia (TNT) and a control group of 27 normal children based on 
the scores of ten psychol ingual va1iables. 
Source: Peter Mull ins. 
This problem involves discriminating between sparrows that did or did 
not survive a severe storm off Rhode Island on 1 February 1 8 89, on the 
basis of five measurements taken on each bird. 
Source: Bumpus (1898). 
This  is the same problem as P except that the two d iscrim inatory 
va1iables are logged (natural log). 
In this case. an attempt is made to d istinguish between two allied species 
of flea beetles that were long confused with one another, on the basis of 
two joint measurements. 
Source: Lubische>v ( 1 962). 
This data involves two measurements (population and area) on each of 
the fifteen largest British cities. excluding London. The fifteen cities 
used are divided into two classes; those with an estimated fatal i ty rate of 
70% or  more resulting from a nuclear strike and those with an estimated 
rate of less than 70%. 
Source: Laurie ( 1 979). 
The data sets were first sorted into two blocks after a Chi-squared test for heterogenei ty of 
variance within groups was carried out for each data set, and where heterogeneity was present 
to a significant amount, those data sets were assigned to the first block (Table 7 . 1 ) ,  otherwise 
to the second (Table 7 .2) .  Within each of the two blocks of comparison methods, the d ata 
sets were ordered by sample size. As some data sets within each block were of similar size, 
those d ata sets were ordered by dimension. 
Table 7 .3  lists some details about each of the data sets, including sample size, dimension, 
number of classes, Chi-squared test for the equal ity of class covariance matrices, equal priors 
or not, variable types and data struc ture. Variable type refers to whether the variables in the 
data set are continuous (normal (N)/skewed (S)). ordinal (0), nominal (C), binary (B)  or a 
mixture of the above five types. Data structure refers to how many of the variables in  the data 
were important for the classification process and how many were iiTelevant. A data set could 
be described as either a paral lel c lassification problem, whereby all of the variables have 
approximately equal weighting, or as a sequential classification problem where relatively few 
variables are imp011ant. As outlined in Section 4.3, trad itional discrimination methods should 
perform best for paral lel classification problems with tree based methods doing better for 
sequential problems. A third category "mixed"' was also used for problems where a particular 
data set d id not fit neatly into any of the above two categories .  Both stepwise discriminant 
analysis and CART' s varible ranking technique were employed to determine into which 
category each of the twenty four data sets should be classed. 
For each data set, priors proportional to class sample sizes were used, and for all methods, 
with the exception of F ACT, m odels were obtained which m inimised the misclassification 
error rate by n-fold cross-val idation, although LOA and CART were also compared using the 
0.632 error rate (see further on in this section).  Tenfold cross-validation was used with FACT 
for reasons outlined in Section 4.3. With kernel density estimation, a normal kernel was used 
with smoothing parameter, h = 0.5. For both CART and FACT, the size below which a node 
would not  be split on  was set to 5 .  It was decided to use the one s tandard error rule 
throughout for CART for the p urpose of consistency, although simulation results i n  Section 
6.4 had shown that the use of the zero standard error mle would be a more reliable estimate of 
the actual error rate for smaller data sets. 
171 
1 72 
Table 7.3: Data sets and their various characteristics 
Data Sample Dimension Classes ., Equal x-
Set Size priors? 
A 374 5 3 155 .94 * lf; No 
B 234 6 2 58.80* ' Yes 
* *  c 234 6 3 107 .62 Yes 
D 234 6 6 294 .29* * Yes 
E 200 7 10 2 1 05.08**  No 
? 
No F 1 88 8 2 42.76 
* *  G 1 74 1 2  4 1 1 08.47 No 
H 1 50 4 .., 1 54 .42'*  Yes .) 
I 1 1 2 7 3 2363.75 * *  No 
* 
J 75 2 2 1 5 .90 No 
K 60 4 3 92 . 1 2' *  Yes 
L 54 6 2 43.23 * *  Yes 
M 45 5 .., 1 35 .50** Yes .) 
N 45 5 3 1 07.8 1 ** Yes 
0 45 5 9 400.55 * * Yes 
* ;I: No p 49 2 2 10 1 .5 1  
Q 26 9 3 1 93 .35*' No 
R 26 2 2 1 1 .22*  No 
s 75 2 2 5 .24 No 
T 54 1 0  2 52.27 Yes 
u 49 5 2 0.69 No 
V 49 2 2 3.92 No 
w 36 2 2 2. 1 5  Yes 
X 15  2 2 9.30 No 
x2 Chi-squared test for homogeneity or tl1e within class covariances 
N Normally distributed variables 
S Skewed variables 
0 Ordinal variables 
C Nominal variables 
B Binary variables 
** a <  0.01 
* <X <  0.05 
Variable Data 
tvpe(s) structure 
O, B parallel a 
s parallel 
s parallel 
s mixedc 
B parallel 
N, S, C B  sequentialb 
N . O. C. B  mixed 
!'\ parallel 
B sequential 
s parallel 
N parallel 
N mixed 
N mixed 
N parallel 
N paral lel 
s mixed 
N mixed 
!'\ parallel 
N parallel 
N sequential 
N parallel 
N mixed 
N parallel 
N, S mixed 
a Refers to a problem where most of the variables are important in forming the classifier. 
b. Refers to a problem where relativel y  few variables are importanl 
c. Refers to a problem not fitting neatly imo one of the above two categories. 
Best 
Method 
LDA 
CART 
CART 
KDE 
LDA 
CART 
LDA 
KDE/QDA 
LDA 
KDE/QOA 
LOA 
LOA 
KDE 
CART 
KDE 
FACT 
KOE 
LOA 
KDE 
FACT 
CART 
QDA 
LDAIKOE 
KDE 
d. The method producing the lowest error rate in Table 7.4. Where one of [ LDA, QDA, kernel density 
estimation (KDE)} had an error rate at least 33% less than one of [ CART, FACT } ,  or vice-versa, that 
method is shown in bold. 
7.3.2 Cross-Validation Error Rate Results 
The results for the twenty four data sets analysed by each of the five methods are shown in 
Table 7 .4. Relating the performance of each method to the characteristics of the data sets 
provides some explanations for the results .  
In terms of covariance structure, i t  can be  seen that CART produced the  lowest cross?
validated error rate on four of the data sets with heterogeneous covariance s tructures, but only 
in  one out of the six data sets with homogeneous covariance structures, and that being the 
sparrow data set (U), which is a trivial case since no tree was formed and all class I objects 
were classified as belonging to class 2. For most of the other fourteen variables with 
heterogeneous covariance structures. CART also performed relatively well with respect to the 
best classification method. though with a number of exceptions. 
There appears to be l ittle common pattern in the results when related to either dimension, 
s ample size or modality. Nor did whether class sample sizes were equal or not have any real 
influence on the comparison between methods. or if they did. the effects were tied up with 
other factors. More important were the types of variables in the data set, how well separated 
were the classes (en-or structure) and the structure of the data. 
With respect to the types of variables first. the results show that CART did well relative to the 
other methods when at least some of the variabies were skewed, categorical or a mixture of 
data types, with the relative performance declining for normally distributed data. In  accord 
with simulation studies in Sections 4 .3 and 4.4 and/or critiques from the l i terature in Section 
2.7,  QDA, and to a lesser extent, kernel density estimation were not suited to categorical data, 
though handling skewed continuous data fairly well .  The performance of LDA markedly 
deteriorated for skewed data, as did FACT, whose error rates were on the whole h igher than 
those of CART as expected from the simulation study results. 
With reference to the i nherent error structure in the data, it is  apparent  that CART' s 
performance relative to other methods was best for high error models (greater than 0.2), but 
for lower error models, worked much better than both CART and FACT. These trends 
support the conclusions made from the simulation studies for both continuous and categorical 
data i n  Chapters 4 and 5 respectively. 
173 
174 
Table 7.4: A summary of cross-validation error rates for data sets A-X 
Data 
Set LDA QDA 
A 0.37 0.4 1 
B 0.30 0. 3 1  
c 0.53 0.52 
D 0.55 0. 6 1  
E 0.27 0.60 
F 0.29 0.34 
G 0.20 0.27 
H 0.07 0.03 
I 0.37 0.48 
J 0. 1 6  0 . 13  
K 0.08 0. 1 2  
L 0.07 0. 1 1  
M 0. 1 1  0.09 
N 0.29 0.29 
0 0.38 NIA 
p 0.24 0.24 
Q 0.27 NIA 
R 0 0.04 
s 0. 1 7  0. 1 7  
T 0.48 0.48 
u 0.45 0.57 
V 0.3 1 0.20 
w 0.03 0.06 
X 0.20 0.27 
a KDE is kernel density estimation 
Method 
KDE3 CART FA C T  
0.39 0.38(8)b 0.42(4) 
0.23 0.22(3 1 )  0.34(2) 
0.30 0.25(40) 0.49(7) 
0.35 0.45(62) 0.56(2 1 )  
0.29 0.34( 1 0) 0.34( 10) 
0.30 0.28(2) 0.36(2) 
0.26 0.2 1 (4) 0.2 1 (3) 
0.03 0.05(3) 0.04(3) 
0.43 0.4 1 (5) 0.46(9) 
0. 13  0.20(3) 0.30(3) 
0. 1 2  0.4 1 (4) 0.45(7)  
0. 1 5  0. 1 3(3) 0. 1 7(6) 
0.07 0. 1 5(6) 0. 1 3 (9 )  
0.3 1  0.22(3) 0.33(6) 
0.31 0.34( 1 2) 0.47 (22) 
0.24 0.30(2) 0.22(2) 
0.04 0.08(3) 0. 1 9(5)  
0.04 0. 1 6(3) 0.23(3) 
0. 12 0.20(3) 0. 1 6 (4) 
0.48 0.50( 1 )c 0.43(2) 
0.57 0.43( 1 )c 0.43( 1 )c 
0.22 0.30(2) 0.24(4) 
0.03 0. 14(2) 0. 1 1 (4) 
0 .13 0.40(2) 0.47( l )c 
b The number in parenthesis indicates the number of te1minal nodes in the decison tree 
c No trees were created in these cases 
NIA QDA was not able to be caiTied out as at least one of the class covariance matrices 
was not of ful l  rank 
Finally, i n  reference to the data structure, one of the tree-based methods has produced the 
lowest e1Tor rate for two (F, T) of the three data sets which were described as sequential. For 
the o ther sequential classification problem (1), CART had the second  lowest error rate. 
Interestingly, for most parallel and mixed classification problems, either LDA, QDA or kernel 
density estimation produced the lowest error rates, although CART did best for three such 
data sets. One could draw the conclusion from these empirical results that tree based methods 
are prefened for sequential classification problems. Unfortunately, there were not more data 
sets of this type to lend more weight behind this assertion. 
In comparing the decision tree sizes of the two tree-based methods, some interesting results 
are n oticeable from Table 7 .4. As Breiman et al ( 1 984) point out, tree size is negatively 
related to the reliability of the classification model, in that smaller trees are heavily b i ased in 
favour of the learning sample. The general trend exh ib i ted here is that CART ten ded to 
produce the larger trees for larger data sets ( 1 -X),  but for larger data sets (A-H), the FACT 
trees were never larger than those produced by CART, and in some cases were considerably 
smaller. The most stiiking examples of the latter situation were for the marks data sets (B-D) 
where CART produced excessively large trees, which, in two cases at least, led to the most 
accurate set of prediction rules. With such large trees, however, one should not h ave too 
much confidence in the resulting set of classi fication rules . 
In general, it appears that CART should best be used for problems involving either skewed or 
categorical data, where the classes are not wel l separated and only a few out of  many 
variables are important in the classification process. In other situations, LDA and/or 
QDNk:ernel density estimation are preferred. It must be noted, in passing, that there are 
always exceptions to the rule and the above recommendations should not be regarded as "set 
in stone". The biomass2 data set (N) involving disc rimination between three types of 
vegetation provides an example of normally distributed data, where all the variables are 
important and the classes are at least moderately separated, and in which CART has produced 
the most accurate set of classification rules. 
1 75 
176 
7 .3.3 0.632 Error Rate Results 
The 0.632 error rates were also calculated for the twenty four data sets using equation 
(4.2 .20), for both LDA and CART. The results of these analyses are shown i n  Table 7 .5 .  
Overall, the results are not drastically di fferent from those found us ing n-fold  cross?
v al idation, in that CART worked well relative to LOA for data sets with heterogeneous 
covariance structures for lognormal data and higher error models. LOA had the lowest error 
rate for two of the three sequential classi fica tion problems. but some other characteris tics of 
these data sets were perhaps intluencing these results.  When compared with the cross?
validation error rates for LOA and CART i n  Table 7 .4 .  it appears that the cross-validation 
e rror rates for both methods were higher than the corresponding 0.632 error rates for a 
majority of the data sets, with a relatively large proportion of cross-validated error rates being 
h igher for smaller sized data sets . Otherwise, there appears to be no patterns in the data 
which determine what error rate will be lower than the other. In comparing tree sizes, the 
trees produced using the 0.632 en?or rate were smaller than those produced by using the cross?
validation error rate, in the main .  which tends to suggest that the classification trees produced 
by using the former estimate were more rel i able than the latter. The resul ts for tree size 
correspond to the findings of the simulation study i n  Section 6.2 .  
7.3.4 Individual Class Error Rates 
In Section 4.5 , simulation results suggested that CART was more sensitive to unequal class 
sample sizes than LOA (and indeed QDA) for continuous data. Table 7 . 6  gives the individual 
c lass error rates for the th i rteen data sets with unequal class sample sizes in an attempt to 
veri fy the above assumptions. The results show that the above findings hold for n ine of the 
thirteen data sets. The instances where the class en?or rates for CART were less variable than 
those for LDA, occuiTed for ei ther categorical variables (A and E) or for skewed variables (P 
and X), with simulation results showing that the error rates for LOA suffered under the latter 
situation. In addition, it should be noted that the ratios of class sample sizes were not too 
d issimilar, except A, that is  ni < 2n.i where ni is the sample size for the class with the largest 
n umber of observations and nj is the sample size for the class with the smallest number of 
observations. 
Table 7.5: 0.632 error rates for LDA and CART 
Data M e thod 
Set LDA CA RT 
A 0.37 0.39(7)3 
B 0.28 0.20( 15 )  
c 0.49 0.28(26) 
D 0.62 0.4 1 (36) 
E 0.30 0.32( 1 0) 
F 0.3 1 0.29(2) 
G 0.20 0.20(2) 
H 0.05 0.05(3) 
I 0.37 0.48( l )b 
J 0. 15  0.23(2) 
K 0.08 0.28(8) 
L 0.08 0. 1 2(3) 
M 0. 1 3  0. 1 4(5) 
N 0.26 0.24(3) 
0 0.22 0.30(7) 
p 0.2 1 0.20(2) 
Q 0.29 0.22(2) 
R 0.23 0. 19(3) 
s 0. 16 0.23(2) 
T 0.39 0.50( l )b 
u 0.45 0.43( l )b 
V 0.3 1 0.20(2) 
w 0.03 0. 1 7(2) 
X 0. 1 3  0.33(3) 
a The number in parenthesis indicates the number of terminal nodes in the decision tree 
b No trees were created in these cases 
177 
Table 7.6: Class error rates for data sets with unequal class sample sizes 
Data 
Set 
Method R(i/j )  
1 2 3 4 5 6 7 8 9 10 
-
A LDA 0.49 1 0. 1 2  
CART 0.53 0.92 0. 1 6  
E LDA 0.06 0.3 1 0.30 0.25 0.23 0.29 0 .21  0.4 1 0.42 0. 1 0 
CART 0. 1 7  0.42 0.25 0.42 0.38 0.29 0.26 0.4 1 0.46 0.30 
F LDA 0.73 0. 1 5  
CART 0.69 0.09 
G LDA 0. 1 4  0. 1 3  0.86 1 
CART 0.27 0.68 1 1 
I LDA 0.44 0.47 0.28 
CART 0.50 0.53 0.3 1 
J LDA 0.20 0. 1 3  
CART 0. 1 3 0.3 1 
p LDA 0.39 0.05 
CART 0.43 0. 1 4  
Q LDA 0. 1 5  0.60 0. 1 3  
CART 0 0.4 0. 1 3  
R LDA 0 0 
CART 0.27 0. 1 3  
s LDA 0. 1 7  0. 1 6  
CART 0. 1 7  0.3 1 
u LDA 0.76 0.2 1 
CART 1 0 
V LDA 0.25 0.38 
CART 0.43 0. 1 4  
X LDA 0.38 0 
CART 0.50 0.29 
R(i/j) = proponion of class j observations misclassified as class i, i ;t; j .  
178 
7 .3.5 The Standard Error Rule in CART 
In the comparative study with other methods in Section 7 .3 .2,  the 1 -SE rule was used in 
CART to select the right sized tree. In Section 6.4, i t  was found through simulation studies 
that the 1 -SE rule i s  inappropriate for smaller samples or when there is l ittle noise in the data, 
unless the populations are not well separated. For situations where there exists a large 
amount of noise in the data, the I -SE rule should be used. Hence, it was decided to analyse 
the en?or rates and tree sizes of the classification rules produced by using both the 0-SE and I ?
SE rules for CART over all twenty four data sets. The 0-SE resu lts are given in  Table 7 .7 .  
Comparing these results with the 1 -SE results in Tables 7 .4 and 7 .5, the empirical evidence 
suggests that tree sizes were not greatly affected by ei ther rule, though, as expected, tree sizes 
for the larger data sets were somewhat reduced. while those for the smaller data sets remained 
basically u nchanged , as evidenced in Section 6 .4. The small number of sequential 
classification problems encountered here. situations in which tree s ize should be greatly 
reduced, makes it very difficult to reach any firm conclusions about the differences between 
the two rules. 
7 .3.6 Splus Trees( ) versus CART 
Since Splus Trees () is  basically the incorporation of the CART method i n to the Splus 
programming  environment, the trees produced by Splus should be of roughly comparable size 
to those produced by CART. with similar en?or rates. The most obvious differences between 
the two methods are that Splus Trees() uses deviances as a measure of goodness of split in  
contrast wi th  the misclassification en?or rate criterion used by CART. in  add i tion to having a 
shrinking algo1ithm as well as a pruning algori thm.  both of which are based on deviances. In 
order to test whether there are any differences between the two methods, Splus Trees() was 
c anied out on al l 24 data sets, using both optimal shri nking and cost-complexity pruning. 
Both n-fold cross validation en?or rates as well as tree sizes were recorded and compared w ith 
CART. The CART resu lts were the same as those used in Table 7 .4, that is, for trees using 
the 1 -SE rule. 
Bradford ( 1993) has compared Splus Trees() with CART using the data sets from Lynn and 
Brook ( 199 1 ) .  Bradford.  however. has only constructed trees using cost-complexity pruning 
and has used tree size as the sole measure of corn pari son between the two methods. 
179 
1 80 
Table 7.7: Error rates for CART using the zero standard error rule 
Data Error Rate 
Set Cross-Val idation 0.632 
A 0.38(8)3 0.39(7) 
B 0.22(3 1 )  0. 1 7(3 1 )  
c 0.25(40) 0.28(26) 
D 0.45(62) 0.4 1 (36) 
E 0. 34(2 1 )  0.32( 10) 
F 0.28(2) 0.29(2) 
G 0. 1 8(6) 0. 1 7 (4) 
H 0.05(3) 0.04(4) 
I 0.4 1 (5) 0.45(3) 
J 0.25 (2) 0.23(2) 
K 0. 38( 1 5) 0.28(8) 
L 0. 1 1 (6) 0.09(6) 
M 0. 1 3 (7) 0.09(6) 
N 0.22(3) 0.22(4) 
0 0.3 1 ( 1 4) 0.30(7) 
p 0.30(2) 0.20(2) 
Q 0.08(3) 0. 1 3(3) 
R 0. 1 9(3) 0. 1 9(3) 
s 0.25(2) 0.23(2) 
T 0.48(2) 0.50( 1 )b 
u 0.43( 1 )h 0.43( 1 )b 
V 0.30(2) 0.20(2) 
w 0. 1 4(2) 0. 1 7(2 )  
X 0.33(3) 0.33(3) 
a The number in parenthesis indicates the numb
er of terminal nodes in the decision tree 
b No trees were created in these cases 
The results are given i n  Table 7 .8 .  The empirical evidence suggests that the final decision 
trees created by S plus Trees() were not always the same as those built by CART. In fact, for 
only 1 3  out of the 24 data sets were the error rates the same. For those examples where there 
were d ifferences between the two methods, the CART trees were generally larger and had 
lower etTor rates than the Splus trees. 
The evidence also suggests that sample size and covariance structure are major factors in 
determining whether CART and Splus Trees() produce the same set of decision rules or not. 
CART's trees were noticeably less succinct than Splus for larger samples, as well as for data 
sets where the covariances were not equal . Therefore, it could be concluded that Splus 
provides shorter trees than CART with more conservative error rates. 
Comparing the shrunken and pruned trees created by Splus. there seems to be only m inor 
differences in tree sizes and error rates. Clark and Pregibon ( 1 99 1 )  believe that optimally 
shrunk trees have lower error rates than pruned trees, but are correspondingly larger. The 
case studies given here produce a number of counter examples to th is assertion, most notably 
for data sets B, C and E. This shows that the cho ice of either optimal shrinking or cost?
complexity pruning should not intluence the final Splus tree to a large extent. 
7 .3. 7 Summary 
In this section, five classification methods from the fields of both traditional discrimination 
and tree-based methods were compared over twenty four  real data sets. The cross-validation 
results showed that, to a considerable degree, which method performed best depended on the 
characteristics of each data set. CART worked well for either categorical or skewed data, 
poorly separated classes and where only a small p roportion of the variables in the data set 
were important in the c lassification process. Otherwise, traditional data d iscrimination 
methods worked best. These findings are in general agreement with the simulation study 
results of Chapter 4. 
On most occasions, CART trees were found to be smaller than FACT trees for smaller data 
sets while FACT trees were never larger than those of CART for larger data sets. 
When using 0.632 error rates, it was discovered that very li ttle differences occurred from the 
cross-validation results when using LOA and CART, though the cross-validation error rates 
tended to be higher than the corTesponding 0.632 error rates, especially for smaller data sets. 
181  
182 
Table 7.8: Cross-validation error rates and tree sizes for CART and Splus Trees() 
Data Method 
Set CART ST-osa 
A 0.38(8)C 0.43(2) 
B 0.22(3 1 )  0.28( 1 2) 
c 0.25(40) 0.42( 1 4) 
D 0.45(62) 0.65( 1 5) 
E 0.34( 1 0) 0.34( 10) 
F 0.28(2) 0.3 1 ( 1  )d 
G 0.2 1 (4) 0.21 ( 4) 
H 0.05(3 ) 0.06(5) 
I 0.4 1 (5) 0.48(3) 
J 0.20(3) 0.20(3) 
K 0.4 1 (4) 0.67(3) 
L 0.13(3) 0. 13(3) 
M 0. 1 5(6) 0.36 (4) 
N 0.22(3) 0.22(3) 
0 0.34( 1 2) 0.79(5) 
p 0.30(2) 0.30(2) 
Q 0.08(3 ) 0.08(3) 
R 0 . 16(3 ) 0. 16(3) 
s 0.20(3) 0.20(3) 
T 0.50( 1 )d O.SO( l )d 
u 0.43( l )d 0.43( 1 )d 
V 0 .30(2) 0.30(2) 
w 0. 14(2) 0. 1 7 (4)  
X 0.40(3) 0.40(3) 
a ST-OS is Splus Trees() using optimal shrinking. 
b. ST-TP is Splus Trees() using cost-complexity tree pruning. 
ST-TPb 
0.43(2) 
0.30(3) 
0.34(20) 
0.58(23) 
0.34( 1 5) 
0. 3 1 ( l )d 
0.21 (4) 
0.06(4) 
0.48(2) 
0.25(2) 
0.67(3) 
0. 13(5) 
0.25(5) 
0.22(3) 
0.59(7) 
0.30(2) 
0. 1 9(2) 
0. 16(3) 
0.20(2) 
0.50( 1 )d 
0.43( l )d 
0.30(2) 
0. 1 7 (4) 
0.40(3) 
c. The number in parenthesis indicates the number of terminal nodes in the decision tree. 
d .  No trees were created in these cases. 
Trees produced by using the 0.632 error rate were generally smaller than those constructed 
using the cross-validation en?or rate, supporting simulation results. 
Also in accord with sim ulation results, CART was found to be more affected by unequal class 
sample sizes than LDA, except perhaps for skewed and categorical data. 
The empirical results given here appear rather inconclusive as to the choice of standard error 
rule to use in CART. Simulation study results suggested that sample size and the amount of 
noise in the data were two dete1mining criteria in such a choice. though only some  evidence 
appears here to suppon both those assertions. 
I n  comparing CART with Splus Trees().  it was found that the two methods produce fairly 
sim ilar sized trees with not dissimi lar error rates, though CART's trees were noticeably larger 
with lower error rates for both larger samples and where the class covariance m atrices were 
not  equal. The empirical evidence also pointed to trees produced by using either cost?
complexity pruning or optimal sh1inking being generally of a similar size with comparable 
error rates. 
4 ILLUSTRATIVE CASE STUDY 
7 .4. 1 Methods and Data 
The data in this study was collected from the Family Planning Association of India (FP AI), 
Lucknow (UP) branch. This data (used as data set G in Section 7.3) contains the information 
on all the fami ly  p lanning cases done at FPAI during 1990. Kumar ( 1 993) hopes that the 
data, especially the analysis done on it ,  will be of some use to the mo tivators and policy 
makers of India and help in developing the promotional strategies for various fami ly planning 
devices, and hence, manpower and resources can be allocated accordingly. 
There are four types of family planning devices measured in this study; IUD (56), Tubectomy 
( 103),  foam tablets (7) and oral pills (8),  with Tubectomy being the only terminal device i n  
nature. The figures in parenthesis represent the number o f  couples that used each family 
p lanning device. Information on twelve socio-economic and demograph ic variables for each 
of the 1 74 couples who accepted the use of the one of the four devices, was also collected. 
183 
184 
The twelve vari ab les were: 
1 .  
2. 
3 .  
4 .  
5 .  
6. 
7. 
8. 
9. 
10.  
1 1 . 
1 2. 
Wife_Age 
Husb_Age 
Husb_Edu 
Wife_Edu 
Occupn 
Income 
No_ Child 
No_Males 
No_Femal 
Age_Baby 
Urb/Rur 
Religion 
Age of wife 
Age of husband 
Education of husband. levels 0-7 
Education of wife. levels 0-7 
Occupation of husband (service. business. fann ing, l abour) 
Household income 
Total number of l iving ch i ldren 
Total number of male l i v ing  ch i ld ren 
Total number of fema le l i v ing ch i l d ren 
Age (in months) of youngest ch i ld 
Whether an indi vidual belongs to an urban or a rural background 
Hindu. Muslim. Ch1istiJn. Sikh 
Thus there was a mixture of continuous. ord inJ I Jnd  nominal categorical variables. Note, too, 
that a previous study by Kumar and S rivastava ( I  989) had found al l  t hese variables to be 
significant while analysing the profi le of thos? couples that accept fami ly p lanning. 
7.4.2 Linear Discriminant Analysis 
In  order to be able to carry out LOA on this da ta set. i t  was first necessary to transform the 
two nominal categorical variables. each having four le\'els. into two batches of three binary 
variables. Considering the relatively l a rge n umber of variab les involved in this problem 
(p = 1 6, with ten untransformed and six binary va1iables).  i t was decided to use stepwise LOA 
(SOA), with stepwise selection of the best q va1iables. using an ex =  0. 1 5  significance level to 
enter va1iables to or  delete from the model . A su mmary tab le of the order they were entered 
in  the model is given in  Table 7.8. 
Table 7.8: Stepwise discriminant variable selection for the family data 
Variable 
Step Entered Removed Number In Partial R2 F Statistic Prob > F 
1 No_Child 1 0.446 45.69 0.00 
2 Wife_Edu 2 0. 1 9 1  13 .27 0.00 
3 Husb_Edu 3 0. 1 58 10.48 0.00 
4 Age_Baby 4 0.072 4.3 1 0.00 
5 Income 5 0.037 2. 1 3  0.09 
As there were k = 4 c lasses in the data set, there were four group classifi cation functions, 
?(x), and six group separation fu nctions. Di_j ( x ) ,  c reated. The four group classification 
functions using SDA are shown in Table 7.9.  Pri ors were set proportional to sample size 
(ppss). 
Table 7.9: Group classification functions, ?(x), using SDA for the family data 
L1 (x) L2(x) L3(x) L4(x) 
Constant -8. 395 1 - I  1 .03 1 2  - 1 1 .930 1 - 1 2 .5852 
H usb_Edu -0. 1 483 0.4608 -0.2586 -0.3092 
Wife_Edu 1 .7542 0.4235 1 . 3594 2.0982 
Income 0.0008 0.00 1 3  0.0024 0.000 1 
No_Child 3.054 1 4.7036 3 .6580 3.582 1 
Age_Baby 0.02 12  0.05 3 1  0.0360 0.0553 
B oth the n -fold  c ross-val idation and 0.632 error rates were calcula ted for the SDA 
c lassi fication rules above, with R(CV) = 0. 195 and R(0.632) = 0. 1 77 .  The corresponding 
v alues for LOA using all sixteen variables were R(CV) = 0.204 and R(0.632) = 0. 1 94, 
showing that the stepwise model was m ore accurate. As numerous authors in the field of 
s tepwise d iscrimination and the closely related topic of stepwise regression h ave pointed out, 
the best q v atiables found from the original sample may not be the best variables over the 
whole population of values. For instance, the five selected vatiables here may not necessarily 
be the most i mportant variables for couples throughout all India seeking fami ly  planning 
advice. 
185 
186 
Visually, the classificati on model is very hard to interpret. As the SDA group separation 
functions i nvolve five variables, it is impossible to depict the full classification m odel. Figure 
7 . 1 shows a three-dimensional graph of the three m ost important variables found during the 
stepwise selection process, that is, No_Child,  Husb_Edu,  and W ife_Edu. [ Key to Figure 7. 1 :  
Club = IUD; Star = Tubectomy; B al loon = foam tablets; Diamond = oral pil ls .]  From Figure 
7 . 1  as well as the table of coefficients in Table 7.9, it is apparent that those who used IUD 
were characterised as having wives with a higher level of education and a smal l  number of 
children. Those cases where the wife had li ttle or no schooling led to the u se of Tubectomy, 
while those cases where the wife had a higher level of education and a larger family also used 
Tubectomy. D istinguishing characteristics for the other two groups are not particularly 
relevant for these three variables. 
As the class samp le sizes were d rastically di fferent. it was decided to do another analysis 
using equal priors. The only di fference that occu rred from Table 7.9 is that the constants 
h ave change as evidenced from Section 4.5 .  Thus. each classification function changed by 
only one term after alteration of the priors. 
7.4.3 CART 
In contrast with LOA. CART used all twelve vari ables to perform the analysis. The CART 
tree, using PPS S ,  is as shown in Figure 7.2 (see Section 3.2 for a description of the CART 
tree analysis) .  To build this tree. the Gini spl itting criterion was employed and the m inimum 
node size was set at 5 .  The l -SE rule was used to select the ?'right sized" tree. The twoing 
splitting c ri terion was also tried, but. in this instance, produced the same tree as that using 
Gini, though this is not always the case. 
From Figure 7.2.  it is clear that those cases where the wife had li ttle or no schooling led to he 
use of Tubectomy. If the wife had a h igher level of education and the h usband was 32 or 
younger, then IUD was the predominant device used. Of those cases n ot already classified, 
husband's  education was the final spli tting variable used. Those cases w here the husband had 
little or no education used Tubectomy in the m ain while those left over were m ore than likely 
to use IUD. For this tree, R(CV) = 0.2 1 3  and R(0.632) = 0.205. 
? - - ?  - -- -- - -? - - -
. - - . - ? ? - - -- ----------, Relationship of Wlfes Education, Husbands Education and Number of Children In the Family to Contraceptive Device Used 
Husb_Edu 
7 . 0 0  
4 . 67  
2 . 33  
- - - -,? - -
d _ __ _ 4 . 67  
Wife_Edu  
igure 7 . 1  
Q ' ' 
- ;,."- ,. _ _ _ _ 
- - - - - - I , - ' '  <l'v - - - - . !. _ _ _ ? 
(>: 
0 . 00 1 . 00  
6 . 00  
No_Ch i l d  
2 . 6 7  
C l ub= I UD ;  Slar=Tu beclomy;  Ba l l o on=foam la b l e ls ;  D i amond=ora l  p i l l s  L_ _________________________________________  _ 
188 
94 
0.09 
? n(t) r(t) 
Is Wife_Edu < 3 .5?  
Tubectomy Is Husb_Age < 32.5'l 
Variable 
No_Chi ld 
Wife_Edu 
Husb A2:e 
No_Males 
Wife_Age 
5 1  
0. 1 8  
IUD 
Relative Importance 
1 00 
78 
63 
62 
60 
7 
0.43 
IUD 
n (t) = number o f  observations i n  the node. 
r(t) = resubstitution error rate of the node. 
Is  Husb_Edu < 3 .5?  
22  
0 .36  
Tubectomy 
Circles represent decision nodes which h ave to be split on while rectangles represent terminal 
nodes which are assigned to a particular class given below the node. 
Figure 7.2: CART Tree with PPSS for the Family Data 
I t  i s  clear from Figure 7.2 that there were a relatively small number of cases who used either 
foam tablets or oral pi l ls, but were all classi fied as using either IUD or Tubectomy. This 
would not be a very good situation if these women were going to have adverse reactions when 
using either IUD or Tubectomy. In an attempt to counter this, another CART tree was grown, 
this time using equal priors and is given in Fi gure 7 .3 .  The CART tree shown here has 
changed m arkedly from that of Figure 7 .2. The first split is similar to that in Figure 7.2 with 
cases where the wife had less than three years ed ucation being classified as using Tubectomy. 
Cases where wives had three or more years ed ucation were next d ivided on the basis of 
income. Cases where wives had three or more years education were next divided on the basis 
of income, Figure 7.4 graphically depicts what happens in the CART tree. The sol id line 
m arks the first split so that all cases to the left of that line were classified as using Tubectomy. 
The interval l ine denotes the position of the second split. Those cases above the interval l ine 
were classified as using foam tablets while those below were predicted to be using oral pills. 
The disturbing feature about this tree, though, was the large number of IUD u sers in  the 
sample, who were all misclassified, which as in the CART tree of Fig ure 7 .2 ,  could lead to 
very serious problems if this classification tree was put into practice. 
As seen in Table 7 .6, the number of misclassified observations from each class with CART 
was negatively related to sample size when ppss were used. When equal priors were used, 
only a smal l  number of foam tablet and oral pi l l  users were misclassified, but, as mentioned 
previously, al l of those who used IUD were falsely classified. 
7.4.4 FACT 
In contrast with CART which is totally non-parametric, FACT uses F-ratios of between to 
within class variance to select the partitioning variable, then caiTies out LDA on the selected 
coordinate axis to partition the data. 
1 89 
190 
85 
0.22 
Tubectomy 
(-:;";\ n(t) \J r(t) 
Is Wife_Edu < 2.5? 
Is Income < 1 950? 
69 
0.54 
oral pills foam tablets 
Variable 
\Vi fe_Edu 
Husb_Age 
Income 
Wife_Age 
No_ Child 
Relati ve Importance 
1 00  
90 
77 
75 
64 
Figure 7.3: CART Tree with Equal Priors for the Family Data 
-------------------- --------- --------------------------------------------------------------? 
5000 -
4000 -
s 3000 I 
0 
E 2000 ? 
? 
1 000 ---i 
I 
0 -
0 
Figure  7 . 4  
P l o t  o f  Income aga i n s t W i fe ' s  Educa t i on fo r  the  
Fam i l y  Data  u s i ng CART w i th Equa l  P r i o r s  
Sp l i t  1 
I I 
X 
X 
l 
- - - ?. - - .  -? - - . .  .' - . . - - .  - - l- - - - - - .  j. - - - .. - -i - I Sp I i t ?. 
' ) X 
? X 
> ? 
:I ? X '1? 
1.. 1 l ? 1 
r-- - l l 
1 2 3 4 
W i f e--Edu 
:I 
'l 
0 ., 
I 
I :1 
l l 
,. I '? 
%1 <1l 
-?--? I 
5 6 7 
i - ?  I UD 
f - ?  F ? o a m t ab l e t s  
x - ? T u b e c t o m y  
o - ?  O r a l  p i l l s  
192 
The FACT trees for the family data are given in Figures 7 .5  and 7.6. Figure 7.5 illustrates the 
c ase of PPSS. The FACT tree is very d ifferem from the CART tree using ppss, s hown in 
Figure 7.2. Interest ingly,  the two variables used to sp l i t  on i n  this FACT tree were, however, 
two of the variables used in the SDA functions. This shows the link between LDA and 
FACT. Although FACT represents its output in a decision tree format, it is basically a 
parametric technique appl ied i teratively to each descendant subsample of observations. One 
m ight then expect the fin al classification  structu re to be s imi l ar to that produced b y  LDA, 
although this has not happened in  this case. 
As can be seen from Figure 7 .5 .  FACT has fa i l ed to correctly c lassi fy any of those people 
who used either foam tablets or oral pi l ls .  Those with three or more ch i ldren were more than 
likely to have a Tubectomy. whi le those with less than three c h i ldren. and a youngest child 
who was three years old or more. were also more than l ikely to have a Tubectomy. Those 
wi th less than three chi ldren and whose youngest c h i l d  was less than three years old were 
m ore than l ike ly to use IUD.  These ru les seem to be straigh tforward and com mon sense 
compared w i th those found in Figure 7 . 2 .  which tended to be more sociological  in nature. 
The FACT tree i nd icates that those wi th ei ther l arger fam i l i es or who had not had any 
c hildren for a wh i le were more than l ikely to use a term inal  comracepti ve device. For this 
tree, R(0.632) = 0. 197 .  
Figure 7 . 6  gives the FACT tree i n  the case of  equal pnors . S im i lar to CART and quite 
d ifferently from LDA. the  decision rules have changed qu ite dramatical ly after alteration of 
the priors. The tree contains only four  spl i ts bu t  a l l  are mul tiway rather than binary splits. 
The end result is a tree w i th ten term inal  nodes. which are represemat ive of all four  c lasses 
( labelled 1 to 4 on the tree for the sake o f  space). Ac tual ly . the class m isclassification error 
rates were not too d issim i lar. 
7 .4.5 KnowledgeSeeker 
KnowledgeSeeker is an e xam ple of a tree-based approach that uses, in contrast to CART, a 
statistical significance test ing approach to spl i tt ing .  D i fferently from FACT, however, X2 
contingency table analysis is used to d ist inguish between the groups. rather than the use of 
means and covariances as employed by FACT. The method is based firml y  on the 
refinements to AID canied out by Kass ( 1 980). which resulted in the CHAID program. 
(":;;\ n(t) ? r(t) 
Is No_ Child < 2.4? 
90 
0.08 
Is Age_Baby < 36.5? Tubectomy 
68 1 6  
0.26 0.33 
IUD Tubectomy 
Variable 
No_ Child 
Wife_Edu 
No_Male 
Wife_Age 
Husb_Age 
Figure 7.5: FACT Tree with PPSS for the Family Data 
Relative Importance 
1 00.0 
98.7 
7 1 .9 
49. 1  
47.5  
193 
194 
8 n(t) 
Wife Edu 
------/ - """? 0<27 1 ?2-4 1 8  448 - c: 1 < 5 048 
2 No_Child 3 No_Femal 
/ I ? / I ? 
< 1 .45 1 .45 - 2 . 5  > 2.5 ? 1 . 1 6 1 . 1 6 - 2.47 > 2.47 
GJ G G G G D 
4 
1 = IUD 
2 = Tubectomy 
3 = foam tablets 
4 = oral pills 
3 Husb_Edu 1 
/ 1 ? 
GJ 
3
GT
>
ffi 
4 
3 
Variable 
Wife_Edu 
No_ Child 
No_Male 
Wife_Age 
Husb_Age 
Relative Importance 
1 00.0 
97.2 
8 1 .6 
35 .6 
34.8 
Figure 7.6: FACT Tree with Equal Priors for the Family Data 
The KnowledgeSeeker trees for the fam ily pl anning data are given in Figures 7 .7  to 7 . 1 2, 
using different pm1itioning methods, significance levels for splitting and splitting variables at 
the first node. Figures 7 .7 to 7 . 1 0  show the trees with the first split being carried out on the 
most i mportant variable. Version 2. 1 of KnowledgeSeeker was used for all four trees.  An 
obvious difference is  the reduction in tree size when the significance level is decreased. 
Noticeable too is the slightly smal ler trees produced using heuristic splitting as was suggested 
i n  the KnowledgeSeeker User' s guide. Added to this is the i ncrease in speed using heuristic 
p artitioning. One trade off in usin g  the heuristic parti tioning algorithm is that a sli g h tly  
different tree may be produced every ti me a new tree is c reated. This  wi l l  not  occ u r  if  
exhaustive partitioning is used. One could argue that a tree may be "pruned" by decreasing 
the significance level used for splitting, but the tree that was origi nal l y  created will only 
rem ai n  unchanged i f  exhaustive partitioning is used for tree construction. 
Comp aring the KnowledgeSeeker trees with those of CART and FACT, it is evident that this 
method has produced somewhat of a compromise between CART and FACT. No_Chi ld was 
selected as the m ost i mportant first splitting variable. as did FACT, and stepwise discriminant 
analysis also showed this to be the most im portant disc1iminating variable. Thus, al l  methods 
which use statistical significance to determ ine the classification rules, whether tree based or 
n ot, chose No_Chi ld  as the m ost im portant splitting variable. Using Figure 7 .9 as the 
standard KnowledgeSeeker tree, it can be seen that those with either one or two chi ldren were 
more than likely to use IUD, wi th greater probability if there was only one child rather than 
two. Of those with three or more children . those with w ives having zero to four y ears 
education were almost all users of Tubectomy. Those with wives having five or more years 
education were quite l ikely to u se either foam tablets or oral p ills. 
No technique for handling priors exists in KnowledgeSeeker, though the tree of Figure 7 .9 
c an be seen to h ave correctly c lassified at least some of the observations from every class. 
The larger trees produced here by KnowledgeSeeker should make the process more robust to 
large d iscrepancies i n  sample size. Reduc tion in the significance levels m ay affect group 
m isclassifications to some degree, al though this has not occurred for the examples presented 
here. 
195 
196 
88 . 0% 
0 . 0% 
4 . 0% 
8 . 0% 1 2 5 
Legend 
Device breakdown 
32 . 2 ? 1 1 2 
3 
4 
total 
84 . 6 % 1 
0 . 0%  
3 . 8% 11 . 5 % 2 6  
Urb(Rur 
0 . 0% 1 
0 . 0% 
0 . 0% 1 
iOO . O% 
59 . 2 % 1 4 . 0% . 4 . 6 % 1  
174 I I 
No_C?i ld 
I 
53 . 4% 3 4 . 5% 
6 . 9% 
58 
5 . 2% 1 
I 
Age_?aby 
[ 1 1 3 6 )  
I n .n l 
1 5 . 8% 5 . 3 %  5 . 3 % 1  
38 ! 
3 
to 
9 
J . J ? I 
92 . 2%  2 . 2 % 
2 . 2% 
90 
Inc9me 
1 1
. 2 ? 
96 . 5% 
0 . 0% 
! nr 2 . 4% 1 E,---1 
Wife Edu 
0 
to to 
? 7 
0 . 0% 
100 . 0% ,  
O . O% j 
0 . 0% 1 
? 
[ 1800 ? 2500 ) 
Figure 7.7: KnowledgeSeeker Tree Method = Heuristic, a = 0.05 
8 4 . 5% 
0 . 0% 
3 . 8% 
11 . 5% 
2 6  
I Ur!:>{Rur 
0 . 0% 
0 . 0% 
0 . 0% 
88 . 0% 1 
0 .  0 % '  
4 . 0% 
8 . 0% 
2 5  
100 . 0% 
1 
Legend 
Device breakdown 
l 
2 
3 
4 
total 
32 . 2% 
59 . 2% 
4 . 0% 
4 . 6% 
174  
No_c:Jrild 
? 
5 3 . 4'i;
i 3 4 .  s% I 6 .  9'1; ,  
5 . 2% 1 5 B  
Age_?aby 
( 1,3 6 )  [3 6 1 132 1 
73 . 7% 15 . 0% 
1 5 . 8'1; 70 . 0% 
5 . 3 '1;  10 . 0% 
5 . 3% 5 . 0% 
3 8  2 0  
0 
to 
2 
0 . 0% 
100 . 0% 
0 . 0'1; 
0 . 0% 
72  
I 
? 
0 . 0% 
80 . 0% 
20 . 0'1; 
0 . 0% 
5 
1 
to 
9 
3 . 3% 1 
92 . 2 % 1 . 
I 2 . 2% 1 
I o2;2% 
Wife Edu 
? 
0 . 0% 50 .0%  
100 . 0% 0 . 0% 
0 . 0% 0 . 0% 
0 . 0% 5 0 . 0% 
5 2 
3 3 . 3 % !  
3 3 . 3 %  
33 . 3 %  
0 . 0 %  
3 
Figure 7.8: KnowledgeSeeker Tree Method = Exhaustive Part i tioning, a =  0.05 
197 
7 
3 3 .3  /. 
33 . 3  ?-
O .O't. 
3 3 .3 i: 
3 
198 
Legend 
Device breakdown 
1 
2 
3 
4 
total 
84 . 6% 
0 . 0% 
3 . 8% 
11 . 5 %  
2 6  
53 . 4% 
3 4 . 5% 
6 . 9% 
5 . 2% 
58 
Age_?aby 
3 2 . 2% 
5 9 . 2% 
4 . 0% 
4 . 6% 
174  
3 
to 
? 
3 . 3% 
92 . 2% 
2 . 2% 
2 . 2% 
90 
I I 
[ 1,3 6 )  [ 36,132 ]  [ 3 0 0 1 1800 ) [ 1800 1 2500 ) 
73 . 7 %  
15 . 8% 
5 . 3 %  
5 . 3 %  
3 8  
15 . 0% 
70 .0% 
10 . 0% 
5 . 0% 
20  
0 
to 
1 
1 . 2% 
96 . 5% 
0 . 0% 
2 . 4% 
85  
T 
Wife Edu T 
5 
to 
1 
0 . 0? 
100 . 0% . 
0 . 0% 
0 .0%  
8 0  
20 . 0% 
4 0 . 0% 
0 . 0% 
40 . 0% 
5 
5 
40 . 0% 
2 0 . 0% 
40 . 0% 
0 . 0% 
Figure 7.9: KnowledgeSeeker Tree Method = Heuristic, a =  0.01 
Legend 
Device breakdo?n 
1 ? .n  
2 5 9 .2% 
3 4 .  0% 
4 4 . 6% 
total 174 
0 
to 
l 
84 . 6% 
0 .  0% 
3 . 8% 
11 . 5% 
2 6  
? 
0 .  0% 
53 . 4% 
34 . 5% 
6 .  9% 
5 . 2% 
58 
1 
to 
9 
3 . 3% 
92 . 2% 
2 . 2% 
2 . 2% 
90 
.___,. 
Wife"'"Edu 
? 
0 . 0% SiJ . O% f 
80 . 0% 100 . 0% 0 . 0% 
0 .  0% 20 . 0% 0 . 0% 0. 0%  
lOO .0% 0 . 0% 0 . 0% 5 0 . 0 %  
0 . 0% 5 5 2 
0 .  0% '---
72 
9 
3 3 . 3% 
3 3 . 3% 
33 . 3% 
0 . 0% 
3 
Figure 7.10: KnowledgeSeeker Tree Method = Exhaustive Partitioning, a =  0.01 
199 
200 
100 . 0? 
0 . 0% 
0 . 0% 
1 . 6\ 
9 6 . 8\ 
1 . 6? 
0 .  0\ 
63 
Legend 
Device breakdown 
1 32 . 2\ 
2 59 . 2\ 
3 4 . 0\ 
4 4 . 6\ total 174 
Wife.,.Edu 
1 to 
1 
12 . 9\ 
80 . 6% 
3 . 2% 
3 . 2? 
3 1  
No_?ale 
9 to 
0 . 0? 
0 . 0\ 
iJ .  J? I 0 . 0\ 
1 
0 . 0? 1 
[ 2 8 , 4 6 ]  
0 . 0\ 
9 8 . H 
1 . 6 \  
0 . 0 \  
? 
100 . 0\ 83 . n l 
1 J . n i  I 0. 0\ I 
> Q  I ? 
I 
Inc9ne 
0 . 0\ 
5 0 . \) \  
5 0 . 0 \  
[ 3 0 0  , 1800 ) 
14 . ] \  
85 . 7? 
0 . 0\ 
0 . 0\ 
28 
0 . 0\ 1 l 2 t..:.__j 
[ 18 24 ) [ 2 4 , 4 1 1  
1 100. 0\ 7 . 7\ 
0 .  0\. 92 . Jt  
0 . 0\ 0 . 0\ 
0 . 0\ 0 . 0\ 
2 2 6  
4 to 
1 
63 . 8% 
2 1 . 2% 
6 . H 
80 
8 . M , 
No_C9i lci 
1 4 r 
2 
? 
0 . 0? 
63 . 6? 0 . 0'. 
I 14 . 9 \ 
I 
0 . 0\ 50 . 0\ 
? 9 . 1? ? 11  I \ 
I 
Age_?iJy 
Figure 7.11: KnowledgeSeeker Tree: Method = Heuristic, a = 0.01. 
Second Best Initial Split Used. 
7.12: 
Leqend 
Device breakdowl 
1 g _n  
2 ?.n 
Ut 
Ut 
total 174 
Wife Age 
[18 1 24) 
75.0t 
I . Ot 
O .Ot 
20.0t 
[24 1m  
6 0  . Ot 
JUt 
1 . 7t 
U .  Ut 
31 20 
t?J_qilri No Child - ' 
i 78 . 9t l I O . Ot 
I 0. Otj 
1 Zl.nJ  
? 
l q lOO . Ot l 
U .  Ot  
o . at 1 
1 i I 
7 
T 
j aut !  
I 8 . 0t I 
I 8 . 0t I I _ . O . Ot c__j 
to 
I 0 .Ot lOO . Ot I 0 .Ot 
I 0 .Ot 
1 10 
[26 411 
lUt 
71 . 6t 
Ut 
I 1 . n 1  
? 
No Cbild - I 
) 
to 
i 
! uti i ?l . lt J  
I z . s t l 
I 2 . \ t  ? I 
[3{)0 ! 1800) [1800 1 2100) 
Wiie)'.du 
to to 
0 .Ot 20 .Ut  
lOO .Ot 40 .0t 
O .Ot O .Ot 
0 .Ot  w . ut  
69  I 
I 
I 
r---1 I 40. Ot I 1 20 .  Ut  
40 .  Ut  I IJ . Ut  ? 
KnowledgeSeeker Tree: Method = Heuristic, a = 0.01. 
Third Best Initial Split Used. 
201 
202 
Figures 7. 1 1  and 7. 1 2  feature the KnowledgeSeeker trees with the first split being carried out 
on the second and third most impo11ant splits respectively. In Figure 7. 1 1 , the first split was 
done on Wife_Edu.  This tree had an apparent error rate lower than the tree of Figure 5.9, 
though, as mentioned numerous t imes before, the apparen t  enor rate is usually a b iased 
estimate of the true performance of any classification rule. This is most particularly true of 
tree-based methods which are data intensive, using subsamples of the Oiiginal data rather than 
all the data to construct a set o f  classification rules . Figure 7 . 1 2  has the first split carried out 
on Wife_Age, then on number of children in the fami ly. 
In order to truly test the validity of the KnowledgeSeeker trees, Version 2.0 was used so as to 
caiTy out some validation procedures. KnowledgeSeeker does not support the use of any 
form of cross validation so the rotation method was implemented as follows. The data set 
was first divided into two equal parts. Each half was in turn used as a 1eaming sample and a 
test sample with the two test sample error rates being averaged to get the rotation estimate of 
the en?or rate. The techn ique was used on all six KnowledgeSeeker trees previously depicted 
(Figures 7.7 to 7. 1 2). The rotation,  resubstitution and 0.632 error rate estimates for the six 
trees are given in Table 7 . 10 .  with the minimum error rate for each est imate, over all six trees, 
given in bold. 
Table 7.10: Rotation, resubsti tution and 0.632 error rates for six KnowledgeSeeker 
trees for the family data 
Error Rate 
Tree Rotation Resubsti tution 0.632 
Figure 7.7 0.236 0. 1 44 0.203 
Figure 7.8 0.282 0. 1 44 0.22 1 
Figure 7 .9 0.2 1 9  0. 1 49 0. 193 
Figure 7 . 1 0  0.248 0.2 1 3  0.22 1 
Figure 7 . 1 1  0.2 14 0.138 0. 1 86 
Figure 7. 1 2  0.184 0-. 1 72 0.180 
From Table 7 . 1 0, i t  is apparent that the tree in Figure 7 . 1 2  had both the lowest rotation and 
0.632 en?or rates, with the tree in Figure 7 . 1 1  having the second lowest error rates of the 
above type, showing that the best tree was not necessarily the one which produced the 
greatest separatio n  of the classes at the first spli t .  The ideal situation is to h ave the 
independent estimate of the error rate equal to the apparent en?or rate so that one c an be 
confident in the classi fication rules generated from the learning samp le. Using this as a 
criterion to c hoose the optimal KnowledgeSeeker tree, the tree i n  Figure 7 . 1 2  using heuristic 
partitioning with a 1 %  signi ficance level, a fter in itially spl i tting on Wife_Age was best. Now 
Wife_Age was noted as only the third most im porta nt variable for spl i tting at the first node. 
This perhaps shows that this tree was the most accurate classifier for the family data, backed 
up by the tree h aving the l owest overall independent error rates. One of the principal reasons 
for the initial creation of decision tree techniques was to identify complex interactions among 
the data, that  could not  be detected by parametric methods, such as LOA, without 
prespecifying the interaction terms directly. Tree-based methods. in general , appear to be 
very dependent on the in i tial split, that is, the largest main effect for the whole sample. If the 
first spl i t  is n ot particu larly good. then it is unl ikely that further pattitionings of the data set 
wi ll lead to a robust set of decisi on rules. I f  the first split is very good, but does not interact 
well with the o ther variables, the decision rules may also be rather weak. The interaction 
s tructure produced by the decision tree is going to be very dependent on the association with 
the first spl itting variable. If spli tting in itially on a variable, xi , does not lead to as p urer 
descendant n odes as spli tting on another variable, xi , i t  may stil l  produce a more robust tree 
than in itially splitting on xj , if the interaction structure is stronger between xi and the other 
variables than between xi and the other variab les. In th is example, Wife_Edu was rated the 
third most i m portant variable or main effect .  but  in real terms, the in teraction between 
Wife_Age and the other variables in the data set produced the more accurate set of decision 
rules usin g  KnowledgeSeeker. Spl itti ng i n i tially on ei ther No_Child or Wife_Edu did not 
produce as good a decision tree as that found through splitti ng first on W ife_Age. 
7.4.6 Splus Trees( ) 
"The tree modelling i nteractive environmen t n ow available in Splus is  to the batch mode 
program C ART as graphical statistical packages such as Splus or JMP are to batc h  processing 
SAS for other statistical methods" (Morton, 1 992, p 76). The method provides an example of 
the CART approach to decision tree growth, while incorporatin g  all the advantages of an 
interactive environment for tree const1uction, pruning and graphics. 
203 
204 
Figure 7. 1 3  contains the ful ly  grown (or overgrown Splus tree) .  Similarly to C A RT, and 
u nlike FACT and KnowledgeSeeker, Wife_Edu was chosen as the initial splittin g  variable, 
thus having the largest main effect for the whole sample. The graph is rather m essy as there 
are too m any splits, hence terminal nodes in the tree. The residual deviance for this tree was 
0.66 1 with R (A )  = 0. 1 49.  Obviously, this error rate is optimistically biased. Figures 7. 1 4  and 
7 . 1 5  give the deviances for the sequences of subtrees produced by cost-com plexity pru ning 
and opti m al shrinking respectively. As mentioned earlier (Section 3. 1 1 ) ,  the deviance always 
decreases as tree size increases, with the former being a step function because optimal 
subtrees remain constant between adjacent values of subtree sizes. 
The rotation method was u sed to determ ine the optimal sized tree, that is, wi th m inimum 
deviance. The average values for the deviance of given tree sizes. using rotatio nal validation 
after optimal shrinking are given in Figure 7 . 1 6 . Th is plot would tend to indicate that a tree 
with six terminal nodes would be optimal but perhaps the use of a standard error rule such as 
that used in CART would tend to suggest that a tree with four terminal nodes would suffice. 
The resultant tree chosen by the rotation method is shown in Figure 7. 1 7 .  The i ni tial spl it  is 
m ade on "Husb_Edu < 3 .5 " '  as occurred with the CART tree in Figure 7 .2 .  The subgroup 
whose wives had less than four years education we re next split on No_Child whi le the other 
subgroup was split on Husb_Age. For those cases where the husband was 3 3  or more years 
of age, a further split was made on ?'Husb_Edu < 3 S'. A final split was m ade on the node 
whose husbands had four or more years education. spli tting on the number of girls in the 
family .  Those fam il ies w ith either no girls or on ly  one girl were m ore than l ikely to use a 
terminal device (Tubectomy) while those with two or more girls opted for oral pi lls. The 
residual deviance for this tree was 0.933 with R(A) = 0. 1 62 .  R(ROT) = 0.207 and R (0.632) = 
0. 1 90. In contrast with the CART tree of Figure 7 .2, at least some of those cases who used 
oral pi l ls were not misclassified. Notice too. that the 0.632 error rate is lower for this tree 
than that i n  Figure 7 .2, d ue most probably to the much smaller apparent rate for the tree in  
Figure 7. 1 7 .  
Figure 7.13: Splus Tree Before Pruning or Shrinking 
205 
? 
deviance 
1 00 1 50 200 250 300 
N-1 
/ .. I r 
i 
'] F 
o-i r-1 
N-i ,-J 
;-i I 
f1guN 7.1(: Oovl?nce Ylrtu.o Tru Slu for Subtrou Produced by C"'t 
Compl.,.lty Pruning 
1 1 N' "' 0
f-? 
I; 
r 
I ,_ :JI 
I "' I 
?? 
I I 
I 
L -
N 
"' 
en 
?- "" ? CD 
0 
,;:; 
? 
1 20 
I 
deviance 
1 40 1 60 1 BO 
____[______ I 
200 220 
I_ I 
240 
---------
Figure 1.1.5: Dtvlanct Ylr1UI  Trt? Slu for Su btut? ProJ uced by Opt! nul 
Shri nk ing  
0 
? 
c ..., c 
0 
"' "' 0 
?:... ?" c 
0 _ ?, ._, 0 
0 
[: j e!  
i 
N-1 
en -I 
?? Q)l 
o-l 
;;;-j 
;;: -j  
deviance 
240 260 280 300 320 340 360 
__...- L? :! 
"' ..., 0 
I 1 0 ? 0 
\ r? 
\ L? w 0 
"' L? "' 0 
...____ L? ..., 0 
Figurt 7.16: Rot.ttlonJt Deviance versus Tru Slu for Subtrtu Produud by 
Optimal  Shrinking 
-1 c:: cr CD 0 
0 
3 
'< 
0 
i;i3 
'"0 
en 
0 
2 cr <D ()-------. 
0 
3 
'< 
-{ c cr ?-----' 
0 
3 
'< 
0 
Figure 7.17: Splus Trees after Optimal Shrinking and Rotational Validation 
207 
208 
7.4.7 Summary 
This section has dealt with a set of data involving family planning information in a certain 
p art of India. In terms of the accuracy of the ru les produced and sensit ivity to the prior 
probabi li ties of class membership, the parametric methods were m arginal ly better, with 
stepwise LDA being the best overall .  This case study has illustrated that C ART is sensitive to 
the choice of priors used in the selection rule. and that care should be taken when i nterpreting 
the results from such a decision tree. Of the four tree based methods i nvestigated, one of the 
trees constructed using KnowledgeSeeker had the lowest 0.632 estimate of error rate. 
It h as emerged from this case study that caution shou ld be shown w hen interpreting the 
results from a decision tree. The choice of the first splitting vatiahle can be very i mportant to 
the future development of the tree. The main effect or first split chosen by a p articular 
method may not necessati ly lead to the most accurate set of decision rules. 
In terms of accuracy of the models created . all methods were fairly s i mi lar. If, however, 
accuracy is not regarded as the sole c titetion on which to judge the performance of particular 
methods, but other factors such as interpretabi l i ty and comprehen sib i l i ty of the m odels 
produced, ease of use etc. then different conclusions to those reached in this chapter as well as 
Chapters 4, 5 and 6 m ay be made. These performance criteria are investigated in Chapter 8. 
8. WHICH CHARACTERISTICS OF TREE-BASED METHODS ARE 
PREFERRED 
8. 1 INTRODUCTION 
In Chapter 3, ten tree-based methods were presented from a methodological point of v iew, 
examining a n umber of characteristics such as spl itti ng criteria, stoppi n g  rules and tree 
pruning methods. 
In this chapter, recommendations are made as to which options are preferred for each of the 
above c haracteristics. Thus the focus is not on com paring methods, but on comparing the 
approaches taken by each of the above methods to grow a c lassific ation tree. These 
recommendations are m ade both on what has been written in the l iterature, but also on the 
results of simulation and case studies unde11aken in this thesis. 
In Section 8 .3 ,  a review is carried out of what other authors from various fields of study have 
written about the various methods studied in this thesis. Recommendations are m ade as to 
what m ethods are preferred from the point of view of human comprehensibil ity and ease of 
u se, based on the findings from the s imulation studies in Chapters 4, 5 and 6, empirical 
studies in Chapter 7 and personal experience. 
8.2 WHICH CHARACTERISTICS OF TREE-BASED METHODS ARE 
PREFERRED? 
8.2. 1 The Method of Splitting 
A uthors of early tree-based methods felt that their splitting rules were sufficient enough to 
grow an accu rate decision tree, believing that the set of rules they developed for selecting the 
best variable at a node and value(s) of that variable to spl it on would produce an accurate and 
robust decision tree classifier. Most authors of more recent methods, starting with the CART 
algorithm of Breiman et al ( 1 984), adopted the approach that regarded splitting rules m ore as 
a heuristic to form an overly large tree to be pruned rather than an end i n  itself, believing that 
the p runing process is perhaps the most important part of the tree growing procedures (see, 
for i nstance, Buntine and Caruana. 1 993). 
209 
210 
In the literature, there seems to be little clear-cut evidence as to which c hoice of splitting rules 
is best. Mingers ( 1 989) shows that the accuracy of a decision tree is not affected by the 
choice of splitting rule, even when variables are selected randomly. B u ntine and Niblett 
( 1 992) pro duce results indicating that ran dom spl itting leads to increased error, but other 
measures perform with similar acuracy to each other. Sim ulation and em pirical studies in this 
thesis (see S ections 4.3 ,  4.4. 5 .3  and 7 .3 .2) have shown that spl ittin g using l inear discriminant 
functions are not recommended for nonnonnal data. examples where the covariance m atrices 
are not equal ,  nor for categorical pred ictor variables. For no1mal data, with equal covariance 
m atrices, l inear discri m inant spli tti ng. as used by FACT, h as been shown to perform 
satisfactoril y .  Comparisons between the i n formation gain measu re of C4.5 and the 
Gin iltwo i n g  good ness o f  sp l i t  criteria o f  CART h ave proved i nconclusive as have 
comparisons between the Gini  and twoing criteria alone (Breiman et al ,  1 984) .  B un ti ne 
( 1 992) has found that the Bayesian quality measure approach employed in IND performs very 
similarly to the approaches used i n  CART and C4.5. Empirical stu dies undertaken in this 
thesis (see Section 7.3 .6) have also shown that trees produced by the Splus deviance goodness 
of split c riterion and CART's Gini cri terion were often rather d i fferen t  even though the two 
splitting m echanisms produced parti tions on the same variables and values of those variables 
at every n ode. This rei nforces the point that the choice of splitting rule is not the m ost 
important step in the building of a rel iable and accurate decision tree classifier. 
Some authors of recent papers in the field of tree-based methods. however, have focused on 
the defici encies of certain existing spl i tting methods and/or created a new type of spli tting 
rule in the bel ief that this will lead to sign ificantly " 'better'' decision trees. For example, 
Todeschin i  and Marengo ( 1 992) have used full p-variate LDA at each stage of the tree 
growing p rocess in order to utilise the splitting power of LDA. Taylor and Silverman ( 1993) 
have emphasised the two main failings of the Gini spl i tting c1iterion, when used with C ART, 
in the case of more than two classes. namely. the tendency to produce two offspring nodes 
that are as p ure as possible and a bias towards spl its which create descendant nodes of 
roughly the same size. They also noted that the twoing spl itting crite1ion, which is also used 
in CART, failed to rectify these ?'weaknesses'' of the Gini splitting criterion. This pro mpted 
Taylor and Si lverman to develop an alternative spl itting rule that placed l ess emphasis o n  
creating pure offspring.  This rule is  known as the Mean Posterior Improvement (MPI) 
criterion. "[T] he MPI criterion is designed to be high when, for all k [classes] , the indi viduals 
of class k are all placed in the same offspring . . .  [but] does not d irectly strive for the offspring 
to be pure". (Taylor and Silverman, 1 993, p 9.)  
To correct against the bias towards equally sized samples, adaptive anti-end cut factors are 
introduced while still attempting to guard against splits which favour radically d ifferent sized 
descendant nodes. The main idea beh ind adaptive anti-end cut factors is to allow the 
differences between descendant node s izes to vary depending on the complexity of the 
problem. 
Taylor and Silvennan appear to place greater em phasis on interpretation of the tree and fitting 
a good m o del to the data at h and rather than constructing a robust tree that can be applied to 
other data sets of the same type. For the examples used in their paper, the independent 
m isclassification rates for trees constructed using the MPI splitting criterion were no lower 
than those found using the Gini spli tting criterion. As Einhorn ( 1 97 2) ,  Doyle ( 1 973),  Doyle 
and Fenwick ( 1 975) and Breiman et a! ( 1 984). am ongst others, have repeatedly stressed, a 
good classification tree is one which wi l l  work well on another data set of the same type. 
8.2.2 Binary versus Multiway Splits 
The choice of the type of splits, binary or multiway. is a question of debate amongst auth ors 
of decision tree methods. Quinlan ( 1 979, 1 986). Loh and Vanichsetakul ( 1 98 8) and B iggs et  
al ( 1 99 1 )  favour the use of multiway spl i ts. whereas B reiman et  a! ( 1 984) and Clark and 
Pregibon ( 1992) prefer the binary spl i tting approach . In the Bayesian approach adopted by 
IND, B untine has used only binary parri tioning. The advantage o f  the binary p artitioning 
approach is  simplicity. The cases in a node can be sent in only one of two ways. The 
direction a case is sent is dependent only on a yes/no question. Multiway splits may involve 
several conditional yes/no questions at the one node. 
A m ajor  debate over whether binary or mul tiway splits are best occurred between Loh and 
Vanichsetakul ( 1988) and Breiman and Friedman ( 1 988) .  Loh and Vanichsetakul argue that 
the use of b inary splits has the following advantages: ( i) categorical variables can be handled 
naturally as ordered variables, and (ii) the idea of surrogate splits is more straightforward to 
implement than if each node is split into varying pieces. They also see the disadvan tage that 
they can produce a highly nested tree w hich leads to an increase in  com plexity and l oss of 
interpretabi lity . Multiway splitting, in their opinion, can reduce the level of nesting in  the 
tree. If the number of partitions created at the root node is the same as the num ber of classes, 
2 1 1  
212 
tree interp retation becomes m uch easier. However, Loh and Y anichsetakul warn that 
m ultiway spli ts m ay produce trees which are too short, stopping before any valuable 
information about the data set can be gained. 
B reiman and Friedman ( 1 988) criticise the bel ief that multiway splitting is superior to binary 
splitting.  Using resu l ts from Friedm an ( 1 977). Breiman and Friedman argue that m ul tiway 
splitting is not as effecti ve in making use of the conditional information present in a tree as is 
binary spl i tting.  B reiman and F1iedman also argue that trees produced using mul tiway splits 
are no m ore interpretable than trees containing binary splits. 
The resul ts of si mulation studies in Sections 4.3, 4.4 and 5 . 2 ,  as well as for real-world data 
sets in Sec tion 7 . 3 ,  have shown that there are no maj or d ifferences in the sizes of the final 
trees produced by the two methods. Using bi nary spl its generally produces longer, narrower 
trees than using multiway splits which prod uce shorter. wider trees. where the size of a tree is 
determined by the num ber of terminal nodes contained in it. (Note : the num ber of terminal 
nodes i n  a tree equals the number of decision points plus one. For example, a tree with four 
terminal nodes has three dec ision poi n ts . while a m-way split contains (m- 1 )  decision points.) 
For exa mple, compare the CART tree for the I ris data, using binary splits ( Figure 3.2) with 
that of the FACT tree, using multiway spl its (Figure 3 .4). Both trees have three terminal 
nodes thu s  contain two decision points. 1l1e CART tree, however. involves two spl i ts while 
the FACT tree produces only one spl it . but parti tions the data into three nodes. 
8.2.3 Univariate versus Linear Combination Splits 
On the q uestion of whether l inear combination splits are preferable over univariate splits ,  a 
m ajor debate also took pl ace between Loh and Yanichsetakul ( 1 988) and B reiman and 
Friedman ( 1 988) .  Loh and Yanichsetaku l preferred the use of l i near combination splits as 
their m ethod involves carrying out LOA at each node. This approach has been taken further 
by Todeschini and Marengo ( 1 992) with the use of ful l  p-variate LDA at each stage of the 
tree growing process. B reiman and Friedman argued that l inear combination splits are not 
better than u niva1iate spli ts ,  stating that in most cases where recursive partitioning has 
performed better than traditional parametric methods. i t  has been through univariate splits. 
No complete comparisons of univariate and li near combi nati on splits in either the CART or 
FACT program were made in this thesis. Comparing Figure 3.6 (CART tree with linear 
comb ination splits for the Iris data) w ith Figure 3.2 (CART tree with univariate splits for the 
Iris d ata), i t  is apparent that Figure 3 . 6  has the lower resubstitution error rate. However, the 
tree i n  Figure 3.6 is no smaller than that of Figure 3.2, instead i t  is now more complex. In 
addition, the cross-validated error rate is 0.07 compared with 0.05 for the tree i n  Figure 3.2, 
implying that using l inear combination splits in this example has led to a less accurate tree. 
Other i nstances where linear combination splits have been used have led to i ncreases i n  
accuracy, sometimes quite large, but these were for problems not suited to  CART (normality, 
low dimension and sample size) .  
If the discriminatory variables are no t  con?elated individually with the classification variable 
but  are highly correlated in tandem with the classification variable then linear combination 
splits should  perfonn better than univariate splits. An example of this scenario, is, in the case 
of a two class, two dimensional problem, where the scores of the first d iscriminatory variable 
are all h igher than the scores of the second discriminatory va1iable for class one cases, while 
the opposite i s  true for class two cases. Generally, it appears that the choice of e i ther 
un ivariate or l inear combination spl its involves a trade off between accuracy and simplicity, 
in the above s i tuation . L inear combination spl its may produce more accurate trees than 
un ivariate spl i ts but the complexity of the rules produced is on a par with tradi tional 
d iscrimination  methods. In the l imited use of linear combination splits with more complex 
problems, t here is no evidence of any increase in accuracy over univariate splits. 
Linear combination splits p rovide the user with direct information about the spl i tting power 
of a n umber of variables at each stage of the tree growing process, via the coefficients 
associated wi th each variable in the li near combination. In contrast, univariate splits provide 
the user with direct information on only one variable at each node of a tree. Indirectly, 
however, most  tree-based methods also provide information on the best competing splits. For 
i nstance, w i th CART, a l is t  is provided of the best  alternate and surrogate splits. I n  
KnowledgeSeeker, the user can immediately investigate the effects on  the tree o f  changing 
from the best possible partition to the best alternate/second best alternate partition etc .  In 
Splus Trees() ,  graphical facilities are available to compare competing splits at a particular 
node. A procedure also exists for automatically changing the current split. 
213 
214 
8.2.4 Costs and Priors 
Feng et al ( 1 993) compared a num ber of decision tree-based methods over a n umber of 
p erformance criteria. Their findings suggested that a method which incorporated a cost?
handling mechanism s hould perform considerably better than methods without such a device. 
Methods such as CART and FACT which incorporate costs into the tree building p rocess, 
therefore, h ave an advantage over C4.5, KnowledgeSeeker and other methods which have no 
capacity for handling costs. Bunti ne argues that IND has cost structures in that a vector of 
costs is combined with the class probabi lity vector so that a minimum c ost decision can be 
made. 
On the question of class priors, both sim ulation and case studies in this thesis (see Sections 
4.5 and 7 . 3 .4) have shown that the choice of priors can dramatically alter the character of the 
fin al decision tree, thus leading to instabil ity in the tree structure. Growing decision trees on 
data sets with grossly  unequal class sam ple sizes. using priors proportional to sample size 
ten ds to lead to trees weighted very heavily in favour of the larger class(es). This results in  
all or nearly a l l  of  the observations from the smal lest class(es) being m isclassified . Using 
equal priors has been shown to someti mes have the opposite effect (see Section 7.4.3 of this 
thesis and B reiman et al. 1984. pp 1 1 2- 1 1 3 ). The message, therefore. is that caution should 
be shown w hen viewing the classi fication trees generated from such data sets. 
8.2.5 Stopping Rules and Tree Pruning 
The use of stopping rules has been viewed by a number of authors of recent tree-based 
method s  as unnecessary, if not inappropriate. due to the fact that " .. . a tree has to be grown 
out before any advantage is real ised .'' (Buntine and Caruana, 1 993, p 3-4). Some recen t  
methods still use direct stopping rules. either stoppi ng when node size falls  below a certain 
value, or the number of terminal nodes is too large, or more commonly, use some measure of 
statistical significance to cease splitting. For instance, FACT stops spl itting when the ratio of 
between to within group variance is less than a certain threshold value w h i l e  
KnowledgeSeeker ceases splitting i f  the optimal spl it o n  a predictor a t  a particular node does 
not exceed a specified significance level B reiman and Friedman ( 1988) criticised t he top?
d own approach stating that it was one of the main reasons why early tree-based methods were 
not real ly  recognised within the statistical community. 
The i llustrative case study of Section 7.4 showed that tree-based methods which used pruning 
algori thms were not always guaranteed to produce the most accurate tree. For that particular 
case study, a KnowledgeSeeker tree was more accurate than both the CART and Splus trees. 
Evidence from the literature, for instance, B reiman et al ( 1 984), Quinlan ( 1 987) ,  Clark and 
Pregibon ( 1992) and Buntine ( 1 992), amongst others, however, would tend to favour the view 
that pruning is prefen?ed over direct stopping rules. 
This naturally leads to the question of what pruning rule one should use? Quinlan ( 1 987) 
conducted an empirical comparison of th ree pruning methods. The three methods tested 
were; cost-complexity pruning, as used in CART. pessimistic pruning. as used in C4.5, and 
reduced error pruning, a technique which reduces a subtree to the best terminal node and 
compares the test sample error rate of the new tree with that of the old. If the new tree h as a 
test sample e rror rate less than or equal to that of the old. the subtree is replaced by the best 
terminal node. Quinlan 's results showed that trees produced using  cost-complexity pruning 
were usually the simplest. but also often the least accurate. He also stated that the method 
required an independent test sample of data although g-fold cross-validation can also be used. 
Buntine ( 1992) has incorporated Bayes pruning i nto the IND procedure. He mentions that 
compatisons with other pruning and smoothing techniques are difficult because the B ayesian 
methods are highly parametric. His  bel ief, though, is that Bayesian pruning is the best 
approach as i t  allows the user to grow and evaluate more trees in less time. 
From the l iterature, as wel l  as simulation and empirical s tudy results, it is difficult  to 
determine which type of pruning algori thm is best as the pruning algorithm is dependent on 
the sequence of trees generated by the spli tting rules, types of splits etc . When comparing 
cost-complexity pruning and optimal shrinking when used with Splus Trees() ,  it  was found 
that which method worked best depended on the data sets, but often the final trees generated 
were exactly the same. 
215 
8.3 HUMAN COMPREHENSIBILITY AND USER-FRIENDLINESS OF 
CLASSIFICATION METHODS 
216 
In assessing the performance of a set of classi fication rules, one should not only be concerned 
with predictive accuracy but also how much information the classification methods provide 
the users, through the complexi ty and interpretability of the resul ts. In other words, another 
major concern in the choice of classification method should be the explanatory power of the 
model. In making any such assessment, it is clear that all such recommendations are of a 
subjective nature, and that there are no ri ght answers only opinions backed up by solid 
arguments. 
As seen in Chapter 3, the ideas behind a decision tree, developed by Belson ( 1 959), were an 
attempt to move away from the over compl icated models of standard statistical techniques 
towards a much simpler approach. These ideas were incorporated into the AID algorithm of 
Morgan and Sonquist ( 1 963) .  Einhorn ( 1 972) ,  Doyle ( 1 973) ,  B reiman et a! ( 1 984) and 
Quinlan ( 1 986) ,  amongst others. concluded that early AID (and THAID) constructed 
unnecessarily large trees. contain ing a number of redundant splits, resulting in a set of rules 
which were as incomprehensible, if not more so.  than the standard statistical techniques they 
were designed to replace. 
In contrast, B reiman et a! ( 1 984), the authors of CART, cite the example of a medical s tudy 
where the objective was to identify high risk heart attack patients, those who will die within 
the next month, based on 19 measurements taken within their first 24 hours of being admitted 
to San D iego Medical Centre. The CART tree contained classification rules based on three 
yes-no questions. S tandard statistical classification methods were far more complicated, and 
i n  this case, l ess accurate. It is stressed that ?'[t] he tree procedure output gives easily 
understood and interpreted information regard ing the p redictive structure of the data" 
(Breiman et al , 1 984, p 58) .  They state that the method h as been used in a wide variety of 
applications, with users finding " . . .  that the classifier provides an i l luminating and natural 
way of understanding the structure of the problem" (ibid, p 58) .  In contrast, they find that the 
standard statistical algorithms including s tepwise discriminant analys is, kernel density 
estimation and Kth nearest neighbour methods are. except for relatively simple problems, 
difficult to interpret. In the case of the latter two techniques, very little useable information is 
gained regarding the structure of the data. 
It m ust be remembered that, as the authors of CART. the opinions of Breiman et al must be 
treated with caution. The other authors of  tree-based methods have also made claims as to 
why their method is best and why other methods fall down, though as Feng et al ( 1993) have 
pointed  out, the studies were either biased in favour of the authors own method so 
demonstrating i ts effectiveness relative to other methods or were conducted over very similar 
data sets (similar in  regards to dimension, sample size etc) ,  so that only a subset of the 
parameter space is tested . Therefore. it  is fairer to use those papers that were mentioned i n  
the previous chapter. involving studies undertaken by authors with no  deep seated incl inations 
towards one method or another. 
Ildi.ko and Lanteri ( 1 989) compared LOA. QDA. CART and SIMCA over various chemical 
data sets. From the point of view of complexity and interpretability of the model, CART was 
the clear winner with usually small, compact binary trees and classification ru les that can be 
used to classify future unknowns from the same population. 
B rown et a! ( 1 993) compared CART with a back propagation neural network algorithm , 
although neural networks are not covered in this thesis. They found that CART trees were 
simple and easy to read, providing a set of useable rules for the future. 
Feng et al conducted a large scale comparatiYe study across a variety of data structures from 
industtial settings. They found that the decision tree methods produced the most interpretable 
resu l ts, whereas the majori ty of trad itional discrimination methods p roduced l i ttle or no 
explanation at all . They also found that tree-based methods were easy to use,  though also 
noted that techniques such as LOA and QDA were user-friendly too. Of the tree-based 
methods used in their study, only two. CART and C4.5 ,  have been mentioned in this thesis. I t  
was suggested that CART produced the smal lest. hence simplest, trees of all such methods, 
with evidence indicating that C4.5 trees were rather more complex than those of CART. No 
other direct comparisons of tree-based methods with those using tradi tional discrimination, 
have been discovered in the literature. 
B ased on the simulation studies undertaken in Chapters 4, 5 and 6 and empirical studies in 
Chapter 7 ,  a subjective compatison of tree-based and traditional discrimination methods can 
be made. For bivariate problems, involving only two classes, the LOA rules are relatively 
simple. Only one group separation function is calculated and an observation is classified into 
217 
218 
one of two classes based on whether the discri m inant fun ction is greater or less than zero. 
Graphically, the problem can also be depicted quite s imply .  (For exa m p le, see Figure 2. 1 . ) 
In  m ost cases where there exists a l i near relationsh i p  between the variables within each c lass, 
decision trees m ay be more complicated, for example. tl1e FACT trees for data sets R and S in 
Table 7.3 ,  with each having four terminal nodes. In higher dimensional, two-class settings, 
the interpretation of the rules produced by LOA becomes more com plex and d ifficult to 
u nderstand,  especia l ly  for non-statistical ly oriented u sers. One could use s tepwise 
d iscri mination to obtain the best two vari ables. then d o  LOA on those two variables to 
calculate a l inear d iscri minant function and graph the resul ts.  However. when p. the number 
of variables, is large, this is inadvisable as inform ation on (p-2) variables in the data set is 
being thrown away. In situations where there are a large number of variables and/or a large 
n u mber of c lasses . dec ision trees such as those p roduced by CART and FACT are 
recommended . As seen from the results for data sets B. C and 0 in Table 7 .3.  though, the 
decision tree approach is not guaranteed to produce the m ost easi ly understood classification 
rules. In those exampl es.  one had to sort through twenty plus q uestions to c lassify a 
particular observation. leading to an unnecessary amount of com plication. 
The rules produced by QOA and kernel density esti m ation were com p letely uninte l l igible to 
anyone without a s tati stical background. One would have to be guaranteed a significant 
increase in  predictive accuracy of the classification m odels produced by these m ethods to 
warrant thei r  use. Empirical studies (and simulations) have suggested that this is not the case. 
From the point o f  view of complexity and interpretabi l i ty of the model ,  decision trees are a 
clear winner. Compact decision trees. such as those created by CART and FACT. are c lear 
and s im p le com pared to the other com plex.  algebraic decision rules associated with 
traditional discrimination meth ods, though a decision tree model is not always going to be the 
simplest. One should always explore altemative approaches if possible. 
Having decided that a tree-based approach is the m ost suitable for the data at h and,  the 
question could be asked as to which method or program should be used? To help answer this 
q uestion. four tree-based methods; CART, FACT. KnowledgeSeeker and Splus trees() were 
compared in  terms of the complexity and interpretabi l i ty of the models produced, as well as 
ease of use or user-friendl iness of the computer package .  
One of the prime motivations for the development of recursive partitioning,  tree-based 
methods was to shy away from the often complex and un intelligible rules produced by 
tradi tional discrimination, at least to the statistical ly i l l i terate, as noted earlier. The decision 
tree output, however, should not be too simple. As seen in  Section 7.4, both the CART tree 
(Figure 7 .2) and the FACT tree (Figure 7 .5)  con tained three terminal nodes. Both of these 
trees had rules which were too simple, as the number of terminal nodes was less than the 
number of classes in the data set. The Splus tree had six terminal nodes, two more than the 
number of classes, though all those cases who used foam tablets were m isclassified .  The 
KnowledgeSeeker tree on the other hand. through the use of multiway splits, had nine 
terminal nodes, but was sti l l  relatively easy to understand .  In addition, at least some of the 
cases who used either foam tablets or oral pi l ls were correctly classified. One could argue 
that the use of equal priors in both CART and FACT created rules which did correctly 
classify many of those who used ei ther foam tablets or oral pills. but this was at the expense 
of the overall accuracy of the tree, and in the case of FACT, made the decision tree overly 
corn plicated. 
Another criterion on which to j udge the four methods i s  the ease of use or user-friendliness of 
the computer package. In ease of use, the menu dri ven approach of KnowledgeSeeker is 
difficult  to beat. Both FACT and the Splus trees() procedures run on Splus, thus one m ust be 
famil iar with the Splus language to be able to grow dec ision trees, then use all the add-on 
facilities that the program provides. With l i ttle or no knowledge of the Splus programming 
language, this would not be the prefened method of choice for the business manager, the 
medical researcher or the social scientist. CART, in its original form, requires  data 
specifications and options files to be set up first, then is run in batch mode. When used with 
Systat ' s  menu driven approach the above problems disappear. 
I n  terms of the abil ity to produce graphical d isplays. KnowledgeSeeker also seems to come 
out  on top. The tree is displayed as it is grown and printed out with the click of a mouse 
button. CART, using Systat, can also display trees graph ically, but the trees m ust be drawn 
separately after each analysis is done. Splus trees() and FACT both require the use of the 
S plus graphics facilities, a relatively easy task if one has mastered the intricacies of the Splus 
language ! Both KnowledgeSeeker and Splus trees() h ave the ability to examine and change 
the variable to be split on at a particular node. Knowledge Seeker lists the most important 
splits at a node. If the user wants to see the effect on the tree of using the second or third 
219 
220 
most important split at  that node, a click of the mouse button allows the user to alter the split 
to be carried out. Splus trees() uses a function to change the split at a node. In contrast, 
CART and FACT split only on the value of a variable that is deemed optimal by each 
p articular algorithm .  The only way that one could see the effect of splitting on another 
variable at a node is to remove the most important varible from the analysis altogether. 
Naturally, this is an undesirable situation for the deleted variable may have had some impact 
i n  the latter stages of tree growth. 
Although it has been stated that KnowledgeSeeker is the preferred method to use for this data 
i n  terms of comprehensibility of the models produced and ease of use, this does not mean that 
KnowledgeSeeker will always do the best for every problem encountered . The ease of use 
and interactive ability make KnowledgeSeeker an appealing method to use but the lack of a 
pruning algorithm may lead to trees that are overly large. hence complex, in some cases, and 
not applicable to other data sets of exac tly the same type. The lack of a true validation 
procedure in the latest version of KnowledgeSeeker also provides some cause for concern. 
For those who have a good working knowledge of Splus, the Splus trees() routine, with its 
CART approach and functions for tree display. growth and modification ,  provides an 
excellent alternative. 
9. CONCLUSIONS AND PROPOSALS FOR THE FUTURE 
I n  this thesis, ten tree-based methods and the four most commonly used methods for 
estimating the conditional  densities of observations, n amely l inear d iscriminant  analysis, 
quadratic discriminant analysis, kernel densi ty estimation and Kth nearest neighbour rules, 
were presented from a methodological point of view. Anicles from the l iterature were used to 
i dentify and summarise where and when one should use each of the above methods. 
A flow chart on a time scale is presented in Figure 9. 1 showing the development of tree-based 
methods. Based on the i deas of Belson ( 1 959), AID was developed as a technique using a 
sequential application of the one-way analysis of  variance model, recursively partitioning the 
d ata into two subsets. The method was designed to predict the value of a continuous response 
variable. THAID was born out of AID in 1 973 to handle categorical response variables. In 
the machine learning and artificial intelligence school of thought. the proposals of Hunt  et al 
( 1966) were developed further by Quinlan ( 1 979) and put into the ID3 algorithm. All three of 
these early methods were critic ised for. amongst other things, producing overly l arge and 
unrel iab le decision trees. Kass ( 1 980) incorporated a significance testing approach into 
THAID to produce the CHAID method. in an attempt to solve the above mentioned problems. 
In contrast, Breiman et al ( 1984) developed tree pruning and validation procedures to build on 
the THAID algori thm .  The end result of their work was the CART program. In 1 9 86, in 
response to cri ticisms of ID3. Quin lan introduced its direct predecessor, C4.5,  which also 
incorporated the idea of a pruning algori thm. FACT incorporated some of the ideas behind 
CART, but used statistical theory to carry out the spl itti ng process. Both CART and FACT 
were criticised by various authors for. most particularly, the pruning algorithms and tree sizes. 
KnowledgeSeeker was developed directly out of CHAID but also included some of the 
approaches used by CART and C4.5.  Splus Trees() was the incorporation of the CART 
method into the Splus programming environment, providing the user with many more options 
and flexibility than were available in the old CART program. The final method mentioned, 
IND, is a combination of the CART and C4.5 approaches to tree growth, tied together with 
B ayesian statistics. This last algorithm seems to be the complete package, al lowing  the user 
to implement either the CART. C4.5. minimum message length or B ayesian tree growing 
routines. 
221 
YEAR 
1959 
1964 
1969 
1974 
1 979 
1 984 
1989 
1993 
222 
KEY 
c:::J methods/programs 
0 
0 
IND 
people 
outside influences not related to time 
major influences 
minor influences 
Figure 9.1: The Development of Tree-Based Methods Over Time 
A range of simulation studies were undertaken in Chapters 4 ,  5 and 6 for both continuous and 
categorical data, involving a wide range of factors, while a number of empirical comparisons 
were canied out in Chapter 7. In Chapter 8, some recommendations were m ade, both from 
the literature and personal findings as t o  which characteristics o f  tree-based methods are 
important, as well as which method is preferred from the point of view of complexity and 
interpretability of the models produced. 
In this, the concluding chapter of the thesis, the recommendations from the penultimate 
chapter are tied together with the findings of Chapters 4 to 7, to classify wh ich  methods 
should be used in particular situations. Some proposals for the development of future tree?
based methods are also provided. The focus in this thesis has been on when to use a tree?
based d iscrimination method in preference to either a parametric method, such as linear 
discriminant analysis. or a non-paramet1ic technique. such as kernel density estimation. It has 
been established (see Chapter 8)  that the tree-based approach ,  in the main, provides a more 
user-friendly approach to examining a set of data. Tree-based methods have also been used in 
conjunction wi th other methods, providing an al ternative way of looking at a data set and 
suggesting possible interactions of variables and uncovering va1ious subgroups. Though tree?
based methods have been in existence for thirty years, there still appears to be a reluctance to 
use a tree-based method on its own to analyse a set of data. A primary objective of this thesis 
has been to compare tree-based methods with other discrimination techniques, through both 
simulation and empi1ical studies, to determine in which si tuations a tree-based method is 
most appropriate. The misclassification error rate of a prediction rule has been used as the 
performance criterion. providing a measure of the statistical power of each method. The 
results of these studies have led to the following set of recommendations. 
For continuous explanatory va1iables (see Sections 4.3, 4.4 and 7 .3 .2), the disuibution of the 
data is the most important fac tor in deciding which method to use. It is well known that for 
normally distributed data a parametric technique such as l inear or quadratic d iscriminant 
analysis is likely to be best. For lognormal data, a non-parametJ.ic technique such as CART is 
recommended, or indeed Splus trees() which util ises the basic CART approach.  Tree-based 
methods, such as FACT, which u se tradi tional statistical methods have unfortunately been 
shown to perform poorly, especially for unequal sample sizes, differing covariances and 
categorical data (see Sections 4.3, 4.4. 4.5 and 5 .3 ) .  The second most  important factor in 
choosing a classification method would appear to be the type of classification problem. 
223 
224 
Sequential classification problems,  where only a few out of many variables are important, are 
suited to CART-like methods. Parallel p roblems. where al l  the variables have approximately 
equal weighting. shou ld be handled by tradi tional d iscrimination methods. Other factors 
which are l inked to the complexity of the problem, that is, sample s ize, d imension and 
covariance structure are less important than the type of distribution. Results indicate that the 
more complex a data set is (larger sample size. higher dimension and unequal covariance 
matrices) the better the perf01mance of CART-l ike methods over other techniques. A highly 
complex problem as defined here, if the vari ables are normally d istributed, may be best 
analysed by CART-l ike methods whi le  for less complex problems where the variables are 
n01mally distributed a traditional discrimination approach wi l l  perhaps be preferred. Another 
consideration in the cho ice of classi fication method is the distance between populations as 
measured by the difference in c lass means or the Mahalanobis distance. For wel l  separated 
populations, tradi tional discrimination methods are preferred while the CART-l ike methods 
are better for poorly separated populations. As for the complexity of the data. however, these 
recommendations are very dependent on the type of distribution and classification problem. 
For categorical explanatory variables (see Sections 5 . 3  and 7 . 3 . 2 ) .  in lower d imensional 
settings. a first consideration is the type of classification problem . CART-l ike methods do 
best for sequential problems while traditional discrimination methods work better for paralle l  
problems, as  occurred for continuous data. When there are a larger number of categorical 
variables, the same rules given above also apply. Other considerations are sample size where 
CART does best for smal ler samples.  whi le traditional discrimination methods are preferred 
for l arger samples, i n  contrast to the recommendations for continuous data. Sl ightly less 
important is the question of distance between populations. CART-l ike approaches are suited 
for less wel l separated popu lations while l i near discriminant analysis etc are preferred for 
highly  separated populations in accordance with results for continuous data. 
If CART was chosen as the tree-based method to use, which error rate estimatOr shou ld  be 
used to choose the opt imum-sized tree from the pruned sequence of subtrees produced by 
CART' s pruning algorithm? Using the twin crite1ia of accuracy, that is, how close the error 
rate of the tree is to the error rate found from running a very large test sample down the tree 
and simplic ity of the rules produced , as wel l as the size of the final decision tree, the 
fol lowing recommendations can be made. If the explanatory variables are continuous (see 
Sections 4.6, 6.2 and 7 . 3 .3) ,  with smal l samples, Figure 9 .2  shows that the 0.632 estimator 
should be used unless only a few variables are impo11ant (sequential classification problem) 
and the classes are poorly separated . For conti nuous explanatory variables, with large 
samples, as above, the 0.632 estimator should be used un less the classification problem i s  
sequential and the classes are not well separated. S imulation study results suggest that in  
such situations, i t  would be  advisable to use the tenfold en?or rate estimate. 
0 
classification problem parallel? 
0.632 populat ions well separated? 
No 
0.632 Tenfold 
Circles represent decision nodes which have to be split  while rectangles represent  terminal 
nodes which are assigned to a particular class given below the node . 
Figure 9.2: Decision Tree for decid ing which Error Rate Estimator to use 
in CART: Cont inuous Explanatory Data 
For categorical explanatory variables (see Sections 6 .3  and 7 .3 .3),  Figure 9 .3  shows that the 
0.632 estimator should be used for small samples, parallel classification problems and well 
separated populations, when the number of variables is not large. For higher dimensional 
problems, the 0.632 en?or rate estimate is recommended for either small or moderate sample 
s izes (and suitable for large sample sizes) or for moderately to well separated populations. 
For other situations, either the rotation or tenfold cross-validation error rate estimates should 
be used, with the former preferTed for smaller samples. The n-fold cross-validation estimate 
should be used with a deal of caution, especially for small samples and poorly separated 
populations. In the l atter situation. n-fold cross-validation was discovered to produce 
excessively optimistic e!Tor rates, hence overly large trees. 
225 
226 
0 
dimension small? 
classification problem paral lel? populations well separated? 
No 
populations well separated? Rotation 0.632 Rotation 
0.632 Rotation 
Circles represent decision nodes which have to be split whi le  rectangles represent terminal 
nodes which are assigned to a particular class given below the node. 
Figure 9.3: Decision Tree for decid ing which Error Rate Estimator to use 
in CART: Categorical Exp lanatory Data 
Having decided on which error rate estimator to use in CART, the next question to be asked 
is whether one should use the one standard en?or rule or not to select the right sized tree (see 
Sections 6.4 and 7 .3.5) .  A set of recommendations on the use of the one standard error rule is 
displayed in Figure 9.4. in the form of a decision tree. In summary, the one standard error 
rule should be employed for small samples and sequential classification problems, and for 
l arge sam pies, when the classi fication problem is paral lel and the populations are well 
separated or for sequential c lassification problems where the popu lations are poorly 
separated. Otherwise, the zero standard error rule should be used. The one standard error 
rule is designed to both con-ect the optimistic bias of the cross-vali dation estimate of error and 
produce as simpler tree as possib le. When there is a large amount of noise in the data, the one 
s tandard en?or rule should be used to remove unwanted splits. If, however, there is  very little 
noise in the data, the one standard rule could lead to some important splits being  removed, 
and hence should not be used. 
0 
sample size small? 
classification problem parallel? classification problem parallel? 
0-SE 1 -SE 
populations well separated? populations well separated? 
1 -SE 
0-SE = use the zero standard error rule.  
1 -SE = use the one standard error rule.  
0-SE 0-SE 1 -SE 
Circles represent decision nodes which can be split while rectangles represent terminal 
nodes which are assigned to a particular class given below the node.  
Figure 9.4: Decision Tree for deciding when to use the One Standard Error Rule 
in CART 
The rest of the chapter is devoted to future trends and developments in tree-based methods. 
With the tremendous advance of technology and computing power in the l ast few decades, 
there has been a corresponding increase in the number of tree-based methods appearing, 
usually with greater sophistication than their immediate predecessors. In ten, or even five 
years, most if not all of the tree-based methods that have been studied in this thesis may be 
227 
228 
regarded as obsolete or discarded in favour of new. i nnovative and faster techniq ues. A 
number of ideas are presented here for the further development and refinement of tree- based 
methods. 
The IND procedure also i ncorporates the decision graph algorithm of Oliver ( 1 992), which 
Buntine ( 1 993) suggests m ay be a hint of what is to come. The method involves a three stage 
process for growing decision graphs: 
1 .  For each node, t. determine the va1iable. x i . to be spli t on. Do not ca1Ty out the split  but 
note the saving in message length. 
2 .  For each pair of nodes, t 1  and t2 ? calcul ate the savmg m m essage length from 
amalgamating the two nodes into one. Do not perform the amalgamation. 
3. Choose the alteration from ( 1 )  and (2) which had the greatest saving. Carry out this 
alteration.  
This approach looks very appeal ing though empi1ical resu lts have shown that there is no real 
increase in accuracy over C4.5 from using this method. If. instead, one wishes to take the 
simple CART-l ike binary-tree approach. how should one proceed? 
From simulation studies undertaken in this thesis. it is apparent that the performance of tree?
based methods is determi ned to a large extent by the characteristics of the data set. A first 
step in any decision tree program should be the printing of summary statistics of a data set. 
These statistics would include di mension, sample size. Mahalanobis distance between classes, 
some measure of skewness, a measure of eq uality of covariance matrices, a variable ranking 
procedure and a COlTelation matrix for all variables in the data set. From these summary 
statistics, the user should be able to know what sort of problem he/she is dealing with. For 
example, the information may indicate that the data is positively skewed, with p oorly 
separated classes and only a few of the many variables bei ng im portant. It would be very 
helpful for the program, on the basis of these summary statistics, to make recommendations 
as to which parameters should be used in the tree building process. 
The purpose of the above procedure is to provide an option of almost complete automation in 
the tree building process, i f  the user so desires. Thus, the simplicity of the procedure would 
be increased. Simpl ic i ty i n  both running the program and i nterpreting the resu l ts should 
remain a key feature in any tree-based program.  Recently, a paper by Todeschin i  and 
Marengo ( 1 992) appeared detai ling the linear discrim inant classification tree (LDCT) method. 
As with FACT, the method is designed to com bine the best features of LDA and 
classification tree methods, but unlike FACT. the algorithm uses ful l  p-variate LDA at each 
s tage of the tree growing process. Although Todeschini and Marengo claim an increase in 
accuracy, the m ethod is not. as they also claim it to be, characterised by low complexity and 
ready interpretabi l ity. Such a method is indeed outside the aims of a tree- based method. 
Future tree-based methods should consider the simpl icity of the interpretation of results as a 
p rimary objective. 
This approach has been taken up by Taylor and S i l verman ( 1993). They have produced a 
new form of displaying a classificati on tree. known as a b lock diagra m ,  usi n g  a 
reimplementation of the CART algorithm.  They focus on the use of tree-based m ethods as a 
m eans of better exploring and interpreti ng the d ata rather than providing a predictive 
classification model. 
Without going into too m uch detail. block d iagrams provide the user, through the colour 
coding of the nodes, with an indication of the splitting power of the discriminatory variables. 
Terminal nodes which are predominantly one colour indicate that CART has been relatively 
successful for a particular problem. while mul ticolo ured nodes show examp les of unreliable 
predictions with a strong overlapping of the classes. "[Taylor and S ilverman] have found that 
block diagrams m ake it possible to identify and rectify failures in the classification method 
itself, rather than just to i dentify features of the classification of the particular data set under 
consideration." (Taylor and Si lverman, 1 993.  p 6 . )  With the incorporation of text to help in 
the understanding of each spl i t  and makeup of each node, an inexperienced user would have 
comparatively little difficulty in interpreting the dec ision tree. 
It has become apparent in this thesis that tree size is the m ajor factor i n  determining the 
comprehensibi l i ty of a particular decision tree. Therefore, it would be appro priate to 
introduce a ctiterion that restricts the size of a decision tree to between preset lower and u pper 
l imits. This would be i ncorporated i nto the pruning algorithm with the tree having the 
229 
230 
m i nimum cross-validation (or 0.632) error rate withi n those limi ts being chosen as the final 
tree. Naturally, the lower bound should not be set too low so as to m ake the tree too s mall ,  
nor should the upper bound be set too high, thus leading to lack of interpretabi lity. 
A sensib le choice of a lower l imit  could be k, the number of classes in the data set, with an 
upper l imit of say 2k. Alternatively ,  a better measure m ay be to take into account how well 
each class is represented in the data set. If  certain classes are not wel l  represented in  the d ata 
set, i t  may be wasteful to attempt to produce extra terminal nodes in order to incorporate these 
classes. One way of detetmining how many well represented classes there are i n  a partic ular 
d ata set is to use m \t), the reciprocal entropy index as used by Taylor and Silverman, where 
* l m (t) = -
7t' 7t 
and 1t' = (n: 1 ,  n:2 , . . .  , n:k)  is the vector of  class probabil ities. For data sets where al l c lass 
sample sizes are equ al ,  m" (t) = k. Otherwise. m "' (t) decreases as class sample sizes become 
m ore disparate. The lower and upper bounds for tree size cou ld be set at m * (t) and 2m \ t) 
respectively. 
The question of correlation between variables raises another point in the formu lation of a new 
tree-based method. It may happen. at a node. that a variable which, while not giving the best 
split, provides the second or third best split . As wel l ,  m ost of the cases sent left or right by 
the best split m ay be sent the same way by the al ternate spli t .  This implies that the two 
variables are highly correlated. At each node of the splitting process, some notification of the 
correlation between variables should be given so that the user knows what would happen if 
the split changed from say x
.i 
< c to xm < d. Such faci l i ties as those provided by 
KnowledgeSeeker for changing splits automatically should be a requirement for any future 
tree-based program. 
All ied with the idea of correlation between variables at a single node is the question of 
correlation between a split at a current node and future paititions. As seen in Section 7 .4, and 
mentioned elsewhere in the thesis. most decision tree methods are one-stage optimal in that 
they are only guaranteed to find the maxi m al separation of the k c lasses at each stage of the 
tree-growing process, that is, at the current node. No account is m ade of what will happen to 
future tree growth if this so-called "optimal'' split is carried o ut. The "optimal" split at eac h  
stage of the tree-growing process m ay n o t  correlate well with future partitions and hence not 
lead to the most accurate tree possible. A lesser split  at the current node, m ay, in contrast, 
correlate well with fu ture partitions and so be best for future tree growth. As men tioned 
p reviously, such a r-stage lookahead option becomes infeasible for large r as pr possible splits 
h ave to be examined where p is the num ber of variables in the data set. M organ ( 1 993) stated 
that such an option did not lead to any real improvements with AID (see Section 3 .3), though 
B un tine ( 1992) has i ncorporated such a facility into the IND program. The case study of 
S ection 7.4 showed that such instances of im provement can and do happen. 
Future research into this area with more detai led sim ulation  and/or empirical studies are 
required to decide whether the lookahead option does provide any significant improvement in 
accuracy over the one-stage opti m ality procedure. With the tremendous advances in 
computing power occurring today, the com puti ng and information storage required is not the 
major drawback it was for the developers of earlier tree-based methods. 
O ther unanswered questions req uiring further research and tests are whether univariate splits 
are preferred to l inear combination spl i ts as wel l as how sequentiallparallel a c lassification 
p roblem appears. As discussed in Secti on 8 .2 .3 ,  various views abound as to w hether 
un ivariate splits are better than linear combination splits. Further simulation and/or empirical 
s tudies should be undertaken to determine which method is preferable both in terms of 
accuracy and overall tree size. In terms of the amount of usable information provided by the 
decision tree, the question of which approach is best remains rather subjective in nature. 
A question mark also hangs over the issue of the type of classification problem, that is 
sequential, where relatively few of the variables are im portant, or parallel, w here m ost of  the 
variables are i mportan t  in forming the classi fier. . In Section 7.3,  a "mixed" c lassification 
p roblem was defined as one which did not fit neatly into being either sequential or parallel .  
Other criteria should be set up so as  to define where a pro blem is best suited to  tree-based 
methods (seq uential problems),  or where the problem is best su i ted to traditional 
d iscrimination methods (parallel problems) though such cri teria are unlikely to completely 
eliminate the fuzzy area between the two. 
231 
232 
A final proposal for future methods is the further development of the FACT s pl itting 
criterion.  Simulation studies have shown that spl i tting using F-ratios and univariate LDA was 
inappropriate for lognorm ally distributed data. One p ossible way of getting around this 
problem is to determine the best split by using non-parametric tests. In the case of two 
classes, the Mann-Whitney, or two sample Wilcoxon rank, test (see Lehmann, 1 975) ,  could 
be u sed for calculating the difference between two class medians. The assumptions for the 
test are that the two sam ples have the same shape and variances. For lognormal data, the 
assumption of equal variances may be violated.  An alternative is the Kruskal-Wal l is (or k?
sample Mann-Whitney) test  which has the sole assumption that the k classes all have the 
same shape. The observations from all classes are pooled together and ranked from 1 to n. A 
test statistic involving the average rankings for each of the k classes is then calculated and the 
variable with the largest of these is used to spl it  on. 
Another altemative is to use the MOOD median test (see Lehmann),  which carries out a form 
of contingency table analysis. Firstly ,  the overal l  median is calculated for all the k classes 
pooled together. Then, for each class. MOOD determines the number of observations less 
than or equal to the overa l l  median. and the number of observations greater than the overall 
m edian. This gives a 2 * k table of cou nts. A X2 test of independence or association is 
carried out on the table and the signi ficance of the result calculated. The MOOD test is more 
robust to outliers than the Kruskal-Wil l is test. but is less efficient for normally distributed 
data. In such cases, the use of parametric tests would be preferable. After determining the 
best variable to split on, spl itting can then be carried out on that variable by means of Raveh' s  
non-metric d iscri m inant analysis method (see Section 2.4. 1 ) , s o  that a s  m any observations 
from the first c lass are greater than or less than those in the second c lass (assuming two 
c lasses). 
This thesis has been designed to serve a number of purposes. Firstly ,  it provides a critical 
reference guide for current users of tree-based methods. Secondly, it gives guidelines as to 
when and where tree-based methods are best used. Thirdly, it offers recom mendations as to 
which options should be employed when using the CART method. Finally, and n o  less 
importantly ,  some suggestions are made as to what a future tree-based m ethod should look 
like. With increasing memory capabil ities and processing speed, tomorrow's comp u ters will 
provide a mec hanism, ready and able to handle the development of m ore sophisticated and 
accurate, yet also more user-friendly, decision tree packages. 
233 
234 
NOTATION INDEX 
Below i s  given a list of the notation used i n  this thesis. The list i s  ordered by the number of the 
page that the term first appears. The general rule adopted here is that vectors and matrices appear in 
bold. 
CART 2 Classification and Regression Trees 
FACT 2 Fast Algorithm for Classification Trees 
X 5 vector of measurements for an observation 
DB (x) 5 Bayes classification rule, optimal rule of allocation 
fi(x) 5 conditional density function of x 
1t? I 5 prior probabil ity that x belongs to class i 
R(B )  5 Bayes misclassification en?or rate, opti mal e zTor rate of any classifier 
k 5 number of classes/populations/groups in the data set 
fl I 6 class/population/group i 
lli 6 mean vector for class i 
l:? I 6 covariance matrix for class i 
l: 6 pooled covariance m atzix for k c lasses 
T(A, f) 6 total probability of misclassification 
D(x) 7 the true disctiminant function 
82 7 square of the true Mahalanobis distance between two classes 
R 1 (T) 7 true probability of misclassifying an observation belonging to c lass 1 
<P(.) 8 cumulative nonnal distribution function 
E(D(x)) 8 expected value of the true disctiminant func tion 
se[D(x)] 8 standard error of the true discriminant function 
n ?  I 8 number of sample observations from class i 
x? I 8 sample mean vector of the observations from class i 
S ?  I 8 sample covatiance m atrix of the observations from class i sP 8 pooled estim ate of the sample covariance m atrix 
D (x) 8 l inear discrim inant function in the case of two populations 
C(ilj) 8 cost of misclassifying an observation from class j to class i 
Li(x) 9 group classification function for class i 
Dij(x) 9 group separation function, linear discrimination function i n  the case of m ore 
than two groups 
Q(x) 12  optimal rule of allocation in the case of unequal class covatiance matrices 
Q(x) 1 2  quadratic disctiminan t  function 
LDA 12  l inear d isctiminant analysis 
QDA 1 2  quadratic disctiminant analysis 
p 1 2  number o f  variables/dimension of the data set 
RDA 1 4  regularised discriminant analysis 
235 
A., y 
NDA 
v(x/Dm) 
F(x/Dm) 
f(x/rim) 
ko(z) 
K(x) 
K-NN 
d(xj , x) 
AID 
THAID 
CH AID 
n 
r(t) 
t 
TSSt 
Yt 
BSS?  J 
Y 1  
B SSm 
TSS 1  
R, Q, P, L  
8y/x 
nt 
m ?  I 
bytx 
Pj 
P lj 
nmin? emin? bmin 
m , n 
I(m,  n )  
AI > . . .  , Av 
t l , . . . , tv 
I(mi, ni) 
E(xj) 
gain(xj) 
p(ilt) 
p U/t) 
236 
14  
14  
1 5  
1 5  
1 5  
1 5  
1 6  
1 7  
1 7  
2 1  
2 1  
2 1  
26 
26 
3 1  
3 1  
3 1  
3 1  
3 1  
3 1  
3 1  
3 1  
34 
34 
34 
34 
34 
34 
35 
36 
37 
37 
37 
37 
37 
37 
4 1  
4 1  
regularisation p arameters used in regularised discriminant  analysis 
non-metric discriminant analysis 
number of class m sample points with values less than or equal to x 
cumulative distribution function estimate 
density function estimate 
smoothing weighting function used in kernel density estimation 
kernel density d iscriminant function 
Kth nearest neighbour method 
distance function between Xj and x 
Automatic Interaction Detector 
THeta AID 
CHi-squared AID 
sample size, sample size at a node 
proportion of observations not from the class with the largest n u m ber of 
observations at each node 
subset of the data. node 
total sum s  of squares for node t 
response i n  node t 
between g roup sum of squares found after splitting on variable xj 
mean of responses in the first su bgroup of node t 
m aximum between group sums of squares over all  variables 
total sum of squares for the whole data set 
parameters used in AID stopping rules 
Theta spl itting cri tetion 
total number of observations at node t 
total number of misclassified observations in the ith split group 
Delta spl i tting criterion 
propottio n  of observations from class j in node t 
propottio n  of observations from class j in split group 1 
parameters used in THAID stopping rules 
the number of observations from classes I and 2 respectively (103 on ly) 
expected information needed to classify an object using an ID3 tree. 
v distinct categories of a variable 
descendant nodes of t 
information required to classify an object using a subtree from lj 
expected infmmation required for trees partitioned on variable xj at the root 
node 
information gained through branching on xj 
probability an object can be assigned to class i at n ode t 
probability an object can be assigned to class j at node t 
i(t) 
s 
.ili(s, t) 
PL, PR 
cl ,  c2 
<l>(s/t) 
Tmax 
T 
Ra(T) 
R(AT) 
0: 
se(R(i)) 
* 
s 
S ?  J 
p(s * , sj) 
PLL(s* , sj) 
t 
L 
p(t) 
sj 
IV(xj) 
IK 
L,J 
L 
se(L) 
E 
Yij 
D(Jli, Yi) 
D()l; y)  
D(jl.L, llR; y)  
.ilD 
DuCT) 
y(node) 
42 estimated probability of m i sclassification under the Gini  index 
42 a split 
42 Gini splitting criterion 
42 proportion of observations at node t sent left or right  respectively by the 
split 
42 estimated probability of m isclassification for the observations sent left by 
the split 
42 amalgamation of classes, superclasses 
42 twoing splitting criterion 
4 3  a fully grown tree 
43 a sub tree of T max 
43 cost-complexity measure for T 
43 resubstitution error rate for T  
43 cost-complexity parameter 
43 number of te1minal nodes in T 
43 subtree that minimises Ra(T) 
44 i ndependent estimate of the error rate, test sample or cross-validation error 
rate estimate 
44 standard error of R(i) 
44 optimal parti tion of a node t into tL and tR 
44 split carried out on va1iable x i 
45 probability that sj sends the cases in  t the same way as  s * 
45 probability that both s * and sj send the cases in t, to the left 
45 set of observations sent left by s 
45 probability that an observation is in node t 
45 sun?ogate spl it on variable xj 
47 correctness of  the answer from splitting on xj 
48 total number of observations in subtree T 
48 total number of  misclassified observations in subtree T 
48 pessimistic view of the number of misclassified observations in  subtree T 
48 standard error of L 
48 number of observations misclassified by the best terminal node within T 
56 probability that the ith response falls in the jth class 
56 deviance function for an observation Yi 
56 deviance of a node. sum of the deviances of all observations in the n ode 
56 combined deviance of the two descendant nodes 
56 difference between deviance of a node and the combined deviance of the 
two descendant nodes 
57 cost-complexity measure for T using deviances 
57 fitted value for each node 
237 
y(parent) 
y(node) 
y(parent) 
R (T) 
R(E) 
R(TS) 
R(A) 
R(H) 
R(ROT) 
R(CV) 
wJ 
R(J) 
8i(x) 
R (PP) 
cj 
c-J 
Q(Cj ,  Cj) 
w 
* * x l , . . . , \ 
R *(A) 
R*(T) 
* 
Pjb 
wb 
wB 
R (B OOT) 
R(0.632) 
R (E) 
wo.632 
R (GCV) 
R (-r) 
Z? ?  lJ 
e 
R 
238 
57 shrunken fitted value for the node's  parent 
57 shnmken fitted value for the node's  parent 
57 fitted value for the node 's  parent 
66 actual or true error rate, expected probability of m isclassification when class 
condi tional density functions are known 
67 the expected erTOr rate for a learning sample of a given size 
68 test sample en?or rate estimate 
68 apparentlresubstitution error rate estimate 
69 holdout enor rate estimate 
69 rotation en?or rate estimate. twofold cross-validation error rate estimate 
69 n -fold cross-validation error r:ate estimate, leave-one-out estimate, U 
estimate 
69 apparent error rate for the leaming sam ple with the jth observation omitted 
70 average jackknife enor rate estimate 
70 jackknife estimate of the bias of the apparent error rate 
70 j ackknife en?or rate estimate 
70 posterior probability that x belongs to class i 
70 posterior probab i l ity en?or rate estimate 
7 I class of observation xj 
7 1  predicted class of observation xj 
7 1  (0, 1 )  loss function 
7 1  true bias involved in usrng the apparent error rate as an estimate of the 
actual error rate 
7 1  random sam ple o f  observations  drawn with replacement from the learning 
sample, bootstrap sample 
7 1  apparent error rate of the bootstrap sample 
72 ac tual error rate of the bootstrap sample 
72 resampled proponion of observations in the bootstrap sample 
72 bias involved in using the apparent enor rate of the bootstrap sample 
72 bootstrap estimate of the bias of the apparent error rate 
72 bootstrap error rate estimate 
72 0.632 error rate estimate 
72 average error rate for all observations not in the boostrap sample 
73 0.632 estimate of the bias of the apparent en?or rate 
73 g-fold cross-validation en?or rate estimate 
73 weighted estimate of the g-fold cross-validation and apparent error rates 
74 standardised distribution with mean zero and standard deviation one 
74 combination or prior probabilities and covariance matrices factor 
76 classification method factor 
ANOVA 
seR(CV) 
PPSS 
R(ilj )  
R(TEN) 
R(ACV) 
R(AR) 
R(AT) 
R(T) 
MSE 
C?  I 
Pij 
rijk 
q 
p 
pi 
bias 
COUNT 
LR(T) 
LRCD 
PR(T) 
PR(t) 
m\t) 
76 analysis of variance 
82 standard error of  the misclassification cost 
93 priors proportional to sample size 
94 group/class misclassification error rates, proportion of observations from 
class i classified as class j 
108 tenfold cross-validation error rate estimate 
108 apparent enor rate for CART trees chosen by n-fold cross-validation 
1 08 apparent enor rate for CART trees chosen by rotation 
1 08 apparent en?or rate for CART trees chosen by tenfold c ross-validation 
1 08 Any error rate estimating the actual error rate 
108 expected value of the squared distance between an error rate estimate and 
the actual error rate 
1 1 5 category i in a categmical variable 
1 1 6 probability of getting a response xj = l for class i, probabi lity pattern factor 
1 1 6 correlation coefficient between xj and xk for class i, correlation factor 
1 3 1  factor combination o f  means, dimension and correlation 
1 3 1  population con?elation coefficient 
1 3 1  population con?elation mattix for class i 
1 32 expected value of the di fference between the actual error rate and the error 
rate estimate 
1 32 proportion of  samples for each factor combination in which the estimated 
enor rate was less than the actual error rate 
1 58 actual error rate after a logit transformation 
1 58 enor rate estimate after a logit transformation 
1 58 actual error rate after a proportion transformation 
1 58 error rate estimate after a proportion transfon ation 
230 reciprocal entropy index 
239 
BIBLIOGRAPHY 
Aitchison, J. and B rown, J.A.C. ( 1 957) The lognormal  distribution with special reference to its 
uses in economics, Cambridge University Press. 
A itchison, J., Habbema, J .D. F. and Kay, J.W. ( 1 977) A critical comparison of two methods of 
statistical d iscrimination, Applied Statistics, 26, pp 1 5-25. 
Anderson, J.A. ( 1 966) Some nonparametric mul tivariate procedures based o n  statistically 
equivalent blocks, in  "Multivariate Analysis" ,  P.R .  Krishnaiah (ed), New York:  Academic 
Press, pp 5-27. 
Ashikaga, T. and Chang, P.C. ( 1 98 1 )  Robustness of Fisher's linear discriminant function under two?
component m ixed n ormal m odels, Journal of the American Statistical Associ ation, 76, pp 
676-680. 
Assael, H. ( 1 970) Segmenting m arkets by group purchasing behaviour: an application of the A.I.D. 
technique, Journal of Marketing Research, 7, pp 1 53- 1 58 .  
Bellman, R.E. ( 1 96 1 )  Adaptive Control Processes, Princeton, New Jersey: Princeton University 
Press. 
Belson, W.A. ( 1 959) Matching and prediction on the p1inciple of biological classification ,  Applied 
Statistics, 8, pp 65-75 .  
B iggs, D . ,  d e  V i lle,  B .  and Suen, E. ( 1 99 1 )  A m ethod of  choosing m ultiway partitions for 
classification and decision trees. Journal of A ppied Statistics , 18.  pp 49-62. 
B radford, E. ( 1 993) Tree-based m odels in S, New Zealand Statistician, 28, pp 36-5 1 .  
Breiman, L. ( 1968) Probability, Reading, Massachussets : Addison-Wesley. 
B reiman, L. ( 1 97 8) Description of chlorine tree development and use, Technical Report, S anta 
Monica, California: Technology Service Corporation.  
B re iman, L. and Friedman, J .H.  ( 1 988)  Com mentary on "Tree-struc tured c lassification via 
generalised discriminant analysis" , Journal o f  the American S tatistical Association, 83, pp 
725-727. 
B re iman, L., Friedman, J . H. ,  Olshen, R .A. and S tone. C.J. ( 1984) Classification and Regression 
Trees, Belmont, California: Wadsworth. 
Bumpus, H.C. ( 1 898) The el imination of the unfit as i l l ustrated by the introduced species, Passer 
Domesticus, B iological Lectures, Matine B iology Laboratory, Woods Hole, 1 1 th Lecture. 
Buntine, W.L. ( 1 992a) Learning classification trees, Statistics and Computing, 2, pp 63-73. 
B un tine, W.L. ( 1 992b) Tree classification software, Presented at the Third National Technology 
Transfer Conference and Exposition, Baltimore. 
B untine, W.L. ( 1 993) Personal Communication. 
Buntine, W.L. and Caruana, R. ( 1 993) Introduction to IND version 2. 1 and recursive partitioning, 
Technical Report, Moffet Field, California: NASA Ames Research Centre. 
Bunti ne, W.L. and Niblett, T. ( 1 992) A further comparison of spli tting rules for decision- tree . 
induction, Machine Learning, 8, pp 75-85. 
239 
Chernick, M.R., Murthy, V.K.  and Nealy, C.D. ( 1 9 85) Application of bootstrap and other 
resampling techniques: evaluation of classifier performance, Pattern Recognition Letters, 3, 
pp 1 67-178.  
Chernick, M.R., Murthy, V.K. and Nealy, C.D.  ( 1 986) CorTection note to " Application of bootstrap 
and other resampling techniques: evaluation of classifier performance" ,  Pattern Recognition 
Letters, 4, pp 1 33- 1 42 .  
Chittineni, C.B . ( 1 977) O n  the estimation of probabi l i ty of eiTor, Pattern Recognition, 9, pp 1 9 1 -
196. 
Chou, P.A. ( 1 99 1 )  Optimal partitioning for classification and regression trees, IEEE Transactions 
on Pattern Analysis and Machine Intelligence, 13,  pp 340-354. 
Ciampi, A., Hogg, S.A., McKinney, S. and Thiffault, J .  ( 1 988) RECPAM : a computer program for 
recursive partition and amalgamation for censored survival data and other situations 
frequently occuring in biostatistics, Computer Methods and Programs in Biomedicine, 26, pp 
239-256. 
Ciampi, A., Lawless, J .F. ,  McKinney, S . M .  and Sing hal, K. ( 1 988) Regression and recursive 
partition strategies i n  the analysis of med ical sur v ival data, Journa l  of Clinical 
Epidermiology, 41, pp 737-748. 
Ciampi, A . ,  Thiffault, T. , N akache, J-P. and Assel ain, B. ( 1 986) Stratification by stepwise 
regression, coiTespondence analysis and recurs ive partition : a comparison of three methods of 
analysis for survival data with covariates, Computational Statistics and Data Analysis, 4, pp 
185-204. 
Ciampi, A. ,  Schiffrin. A, Thi ffault, J . ,  Quintal, H .. Weitzner, G .. Poussier, P. and Lalla, D. ( 1 990) 
Cluster analysis of an insulin-dependent diabetic cohort towards the definition of clinical 
subtypes, Joumal of Clinical Epidermiology, 43. pp 70 1 -7 1 5 . 
Clark, L.A. and Pregibon, D .  ( 1 992) Tree-based models, in " Statistical Models i n  S " ,  J.M. 
Chambers and T.J. Hastie (eds), Pacific Grove, California: Wadsworth and B ro oks/Cole, 
pp 377-4 19. 
Coc hran, W.G. ( 1 968) Com mentary on " Estimation of error rates in  d iscriminant analysis " ,  
Technometrics, 10 ,  p p  204-205. 
Cook, E .F. and Goldman, L. ( 1 984) Empiric comparison of multivariate analysis techniques : 
advantages and disadvantages of recursive partiti oning analysis, Journal of Chronic Diseases, 
37, pp 72 1 -73 1 .  
Cover, T.M .  and Hart, P.E. ( 1 967) Nearest neighbor pattern classification, IEEE Transactions on 
Information Theory, IT-13, pp 697-722. 
Crawford, S.L. ( 1 989) Extensions to the CART algorithm, Journal of Man-Machine Studies, 31,  pp 
197-217 .  
Crawford, S .L.  and Souders, S.K. ( 1 990) A comparison of two conceptual clustering a lgorithms, 
International Journal of Pattern Recognition and Artificial Intelligence, 4, pp 409-420. 
Crawford, S .L. ,  Fung, R.M. and Tse, E. ( 1 989) Data-driven assessment and decision m aking, in 
"Expert Systems in Economics, Banking and M anagement" , L.F. Pau et al (eds), Amsterdam: 
North-Holland, pp 399-408. 
de Ville, B .  ( 1990) Applying statistical knowledge to database analysis and k nowledge base 
construction, Proceedings of the Sixth Conference on Artificial Intelligence Applications, pp 
30-36. 
240 
de Ville, B. ( 1 994) Personal Communication. 
Doyle, P. ( 1 973) The use of Autqmatic Interaction Detector and similar search procedures, 
Operational Research Quarterly, 24, pp 465-467. 
Doyle, P. and Fenwick, I .  ( 1 975) The pitfalls of AID analysis, Journal of Marketing Research, 12, 
pp 408-4 13 .  
Efron, B .  (1979) B ootstrap methods: another look at the jackknife, Annals of  Statistics, 7 ,  pp 1 -26. 
Efron, B .  ( 1 982) The Jackknife.the Bootstrap and Other Resampling Methods, Philadelphia: 
Society for Indust1ial and Applied Mathematics. 
Efron, B. ,  ( 1 983) Estimating the error rate of a prediction rule: Improvement on cross-validation, 
Journal of the American Statistical Association, 78, pp 3 1 6-33 1 .  
Einhorn, H.J. ( 1972) -Alchemy i n  the behavioural sciences, Public Opinion Quarterly, 36, p p  367-
378. 
Euromonitor ( 1 979) European Marketing Data and Statistics. London: Euromonitor Publications. 
Feng, C., Sutherland, A, King, R. , Muggleton. S. and Henery, R. ( 1 993) Comparison of machine 
learning classifiers to statistics and neural networks. Personal Comm unication. 
Fielding, A. ( 1 977) Binary segmentation: the Autom atic Interaction Detector and related techniques 
for exploring data structure, in " The Analysis of Su rvey Data, V olume 1 " ,  C .A.  
O'Muircheartaigh and C.Payne (eds). London: John Wiley, pp 22 1 -257. 
Fisher, R.A. ( 1 936) The use of multi ple measurement in taxonomic problems. Annals of Eugenics, 
7, pp 1 79- 1 88.  
Fitzmaurice, G.M. ,  Krzanowski. W.J.  and Hand, D.J .  ( 1 99 1 )  A Monte Carlo study of the 632 
estimator of en?or rate, Joumal of Classification, 8. pp 239-250. 
Fix, E. and Hedges, J .L. ( 195 1 )  Discriminatory analysis, nonparametric discrimination: consistency 
properties, Report No. 4, Project No. 2 1 -49-004, U S AF School of Medicine, B rooks Air 
Force B ase, Randolph Field, Texas. 
Friedman, J.H. ( 1 977) A recursive partitioning decision rule for nonparametric classification, IEEE 
Transactions on Computers, E-26, pp 404-408. 
Friedman, J.H. ( 1 989) Regularised discriminant analysis, Journal of the American S tatistical 
Association, 84, pp 1 65- 1 7  5. 
Fukunaga, K. ( 1 990) Introduction to S tatistical Pattern Recogni tion, Second Edition, Boston: 
Harcourt B race Jovanovich. 
Ganesalingam, S. and Lynn, R.D. ( 199 1 )  Posterior probability based estimator for the overall error 
rate associated with a linear discriminant function, Occasional Publications in Mathematics 
and Statistics, 23, Massey University. 
Ganeshanandam, S. and Krzanowski, W.J. ( 1 990) En?or rate estimation in two-group discriminant 
analysis using the l inear discriminant function , Journal of S tatistical Computation and 
Simulation, 36, pp 1 57- 1 75.  
Glick, N. ( 1 978) Additive estimators for probabilities of COITect classification, Pattern Recognition, 
10, pp 2 1 1 -222. . 
Gnanadesikan, R. ( 1 977) Methods for Statistical Data Analysis of Multivariate Observations, New 
York: John Wiley. 
241 
Gong, G. ( 1 986) Cross-validation, the jackknife and the bootstrap: excess error estimation in 
forward logistic regression, Journal of the A me1ican Statistical Association, 81,  pp 1 08- 1 1 3. 
Gordon, L. and Olshen, R.A. ( 1 980) Consistent nonparametric regression from recursive 
partitioning schemes, Journal of Multivariate Analysis, 10, pp 6 1 1 -627. 
Grajski, K.A. , Breiman, L., Viana Di P1isco, G. and Freeman, W.J. ( 1 986) Classification of EEG 
spatial patterns with tree-structured methodology: CART, IEEE Transactions on B iomedical 
Engineering, 33, pp 1076- 1086. 
Habbema, J.D.F., Hermans, J. and van den Broek, K.  ( 1974) A stepwise discriminant analysis 
program using density estimation, in "Compstat 1 974", G.B ruckmann, F.Ferschl and 
L.Schmetterer (eds), Vienna: Physica-Verlag, pp 1 0 1 - 1 10. 
Hadj imichael, M. and Wasilewska, A. ( 1 993) Interactive inductive learning, International Journal 
of Man-Machine Studies, 38, pp 1 47- 1 67.  
Hand, D.J. ( 198 1 )  D iscrimination and Classification. New York: John Wiley. 
Hand, D.J. ( 1986) Recent advances in en?or rate estimation. Pattern Recognition Letters, 4, pp 335-
346. 
Hanson, S.J. ( 1990) What connectionist models leam : learning and repesentation in connectionist 
neural networks, Brain and Behavioral Sciences, 13, pp 47 1-5 1 1 .  
Heald, G.I. ( 1 972) The application o f  the Automatic Interaction Detector (A.I .D.) programme and 
multiple regression techniques to the assessment of store selection and site performance, 
Operational Research Quarterly, 23, pp 445-457. 
Hosmer, D.W. and Lemeschow, S. (1989) Applied Logistic Regression, New York: Wiley. 
I ldiko, E.F. and Lanteii, S. ( 1 989) Classification m odels: discriminant analysis, SIMCA, CART, 
Chemometrics and Intelligent Laboratory Systems, 5. pp 247-256. 
Jennrich, R.I. ( 1 977) Stepwise d iscriminant analysis. in " S tatistical Methods for Digital Computers, 
Volume 3",  K.Enslein ,  A.Ralston and H. S.Wilf (eds), New York: John Wiley, pp 76-95. 
Kanal, L. ( 1974) Patterns in pattem recognition:  1968- 1974, IEEE Transactions on Information 
Theory, IT-20, pp 697-722. 
Kass, G.V. ( 1 975) Signi ficance testing i n  auto m atic interaction detection (A.I.D.) ,  Applied 
Statistics, 24, pp 17 8- 1 89. 
Kass, G.V. ( 1 980) An exploratory technique for investigating large quantities of categorical data, 
Applied Statistics. 29, pp 1 1 9- 1 27. 
Kendall ,  M.G. ( 1 9 66) Discrimination and classification, in "Multi variate Analysis",  P.R.  
Krishnaiah (ed), New Y ork: Academic Press, pp 1 65- 185. 
Konishi, S. and Honda, M. ( 1 990) Comparison o f  procedures for estimation of error rates in 
discriminant analysis u nder non-normal populations, Journal of S tatistical Computation and 
S imulation, 36, pp 1 05 - 1 1 5 .  
Krzanowski, W.J. ( 1 977) The perfonnance o f  Fisher's l inear discriminant function under non?
optimal conditions, Tec hnometrics, 19, pp 1 9 1 -200. 
Kumar, K. ( 1 993) Applicatio n  of discriminant analysis and Mahalanobis distance to the family 
planning data, Personal Communication. 
? 
242 
Kumar, K. and Srivastava, S .  ( 1 989) An analysis of the profile of accepters of fami ly  welfare 
programme i n  India, Paper presented in the International Union of the Scientific S tudy of the 
Population Conference, New Delhi .  ? 
Lachenbruch, P.A. ( 1967) An almost unbiased method of obtaining confidence intervals for the 
probability of misclassification in discriminant analysis, B iometrics, 23, 
pp 639-645. 
Lachenbruch, P.A. ( 1 975) Discriminant Analysis, New York: Hafner Press. 
Lachenbruch, P.A. and M ickey, M .R. ( 1 968) Estimation o f  error rates in discri minant analysis, 
Technometrics, 10, pp 1 - 1 1 .  
Lachenbruch, P.A., Sneeringer, C .  and Revo, L.T. ( 1 973) Robustness of the l inear and q uadratic 
d iscrim inant function to certain types of non-norm ali ty,  Communications in Statistics -
Simulation and Computation, 18,  pp 39-56. 
Laurie, P. ( 1 979) Beneath the City S treets, London : Granada Publ ishing. 
LeBlanc, M. and Crowley, J. ( 1 993) Survival trees by goodness of split, Journal of the American 
S tatistical Association, 88, pp 457-467. 
Lehmann, E.L. ( 1 975) Nonparametrics :  Statistical Methods B ased on Ranks, San Francisco: 
Holden-Day, Inc. 
Lesaffre, E . ,  Wi lliams, J .L.  and Al bert. A. ( 1 989) Estimation of error rates in multiple group 
logistic discrimination analysis. The approximate leaving-one-out method, Communications 
i n  Statistics - Simulation and Computation, 18. pp 2989-3007. 
Loh, W.Y. and Vanichsetakul ,  N. ( 1 988) Tree-structured c lassification via generalised discri minant 
anal ysis, Journal of the American Statistical Association, 83, pp 7 1 5-725. 
Lubischew, A.A. ( 1 962) On the use of discriminant functions in taxonomy, B iometrics, 18, pp 455-
477 .  
Lynn, R . D .  and Brook, R.J. ( 1 99 1 )  Classification by decision trees and discriminant analysis, New 
Zealand Statistician, 26, pp 1 8-26. 
Lynn, R.D. ,  B rook, R.J. and Arnold, G.C. ( 1 993) A comparison of four classification methods: 
l inear and quadratic discriminant analysis, CART and FACT, Mathematical and Information 
Sciences Report, Series B: 1 ,  Massey Universi ty. 
McLachlan ,  G .J. ( 1 974) Estimation of the enors of misclassifi.cation on the criterion of asymptotic 
mean square eiTor, Technometrics, 16, pp 255-260. 
McLachlan, G.J. ( 1 977) A note on the choice of a weighting functi on to give an efficient method 
for estim ating the probability of misclassification, Pattern Recognition, 9, pp 1 47 - 1 49. 
McLachlan ,  G .J. ( 1 986) Assessing the performance of an a l location rule, Computing a n d  
Mathematics with Applications, 12A, pp 26 1 -272. 
McLachlan ,  G.J. ( 1 987) Error rate estimation in d iscrimi nant analysis: recent advances, in 
" Advances in  Multivariate Statistical Analysis", A.K. Gupta (ed), Amsterdam: Dordrecht, p p  
235-252. 
McLachlan, G.J. ( 1 992) Discriminant Analysis and Statistical Pattern Recognition, New York: John . 
Wiley. 
Mabbett, A., S tone, M. and W ashbrook, J. ( 1 980) Cross-validatory selection of binary variables i n  
differential d iagnosis, Applied Satistics, 29, pp 1 98-204. 
243 
Margo1ese, M.M. ( 1 970) Homosexuality: A new endocrine corTelate, Hormones and Behaviour, 1 ,  
p p  1 5 1 - 155. 
Marks, S. and Dunn, O.J. ( 1974) Discriminant functions when covariance m atrices are unequal, 
Journal of the American Statistical Association. 69. pp 555-559. 
Marshall, R.J. ( 1 986) Partitioning methods for classi fication and decision making in medicine, 
Statistics in Medicine, 5. pp 5 1 7-526. 
Marshal!, R.J. ( 1990) Non-hierarchical partition analysis for classification and for identification of 
subgroups in medical research, Personal Communication. 
Messenger, R.C. and Mandell, M.L. ( 1 972) A modal search technique for predictive nominal scale 
multivariate analysis, Joumal of the American Statistical Association, 67, pp 768-772. 
M ichie, D. ( 1 986) Current developments in expert systems, in "Applications of Expert S ystems, 
Volume 1 ", J .R. Quinlan (ed), Wokingham : Addison-Wesley, pp 1 37- 1 56. 
M ichie, D. ( 1989) Problems of computer-aided concept formation, in "Applications of Expert 
Systems, Volume 2" ,  J.R. Quinlan (ed), Woki ngham : Addison-Wesley, pp 3 1 0-333. 
Mingers, J. ( 1 989) An empi rical comparison of selecti on measures for decision- tree induction, 
Machine Leaming, 3, pp 3 1 9-342. 
M initab Inc . .  Version 9. State Col lege , PennsylYania. 
Moore, D.H. IT. ( 1 973) Evaluation of five discrimination procedures for binary variables, Journal of 
the American Statistical Association. 68. pp 399-404. 
Moore, D.S. and McCabe, C . P. ( 1 989) In trod uction to the Practice of Stati stics, San Francisco: 
Freeman .  
Morgan, J.N. ( 1 990) A condi tional analysis o f  movers' housing responses, Journal of Economic 
Behaviour and Organisation. 
Morgan J .N. ( 1 993) Personal Communication. 
Morgan , J .N. and Messenger. R.C. ( 1 973) THAID: a Sequential Search Program for the Analysis of 
Nominal Scale Dependent Variables. Ann Arbor: Insti tute for Social Research.  University of 
Mich igan. 
M organ, J .N. and Sonquist, J . A. ( 1 963) Problems in the analysis of survey data, and a proposal ,  
Journal of  the American S tatistical Association. 58.  pp 4 1 5-434. 
Morton, S .C. ( 1 992) Personal crunching : new advances in statistical dendrology, Chance, 5, pp 76-
79. 
Muxworthy, D.T. ( 1972) Review of AID Ill, British Sociological Association Maths, Statistics and 
Computing Applications Group Newsletter. 9. 
01iver, J.L. ( 1 993) Decision graphs - an extension of decision trees, to appear in Artificial 
Intell igence and Statistics, 1 4. 
O'Neill ,  J .L. ( 1986) Knowledge acquisition for radar classification, in "Applications of Expert 
Systems, Volume 1 " , J.R. Quinlan (ed). Wokingham: Addison-Wesley, pp 1 84- 199. 
Pawlak, Z. , Wong, S .K.M. and Ziarko, W. ( 1 988) Rough sets: probabi listic versus deterministic 
approach, International Joumal of Man-Machine Studies, 29, pp 8 1 -95. 
244 
Picard, R.R. and B erk, K.N. ( 1990) Data splitting, Ameiican S tatistician, 44, pp 1 40- 1 47 .  
Prada Sanchez, J .M.  and Otero Cepeda, X.L. ( 1 989) The use of smooth bootstrap techniques for 
estimating the error rate of a prediction rule, Co mmunications in Statistics - Simulation and 
Computation, 18, pp 1 1 69- 1 1 86. 
Quenouille, M. ( 1 949) Approxi mate tests of correlation in ti m e  series, Journal of the Royal 
Statistical Society Series B,  1 1 ,  pp 1 8-84. 
Quenouille, M. ( 1 956) Notes on bias estimation, Biometrika, 43, pp 356-360. 
Quinlan, J.R. ( 1 979) Discovering rules from large collections of examples: a case study, in " Expert 
Systems in the Micro Electronic Age" , D. Michie (ed), Edinburgh: Edinburg h  University 
Press. 
Quinlan, J.R .  ( 1 983) Learning efficient classification procedures and their application to chess end 
games, in "Machine Learning, an Artificial In te l ligence A pproach, Volume 1 " ,  R. S .  
Michalski, J.G. Carbonell and T.M. Mitchell (eds). Palo Alto, California: Tioga, pp 463-482. 
Quinlan, J.R. ( 1 986) Induction of decision trees. Machine Leaming, 1, pp 8 1 - 1 06. 
Quinlan,  J.R. ( 1 987) Simplifying decision trees. Intemational Joumal of Man-Machine Studies, 27, 
pp 22 1 -234. 
Quinlan, ( 1 993 ) Comparing connectionist and symbolic learning m ethods, in "Com putational 
Learning Theory and Natural Learning Systems :  Constraints and Prospects" ,  S .  Hanson, G .  
Drastal and R .  Rivest (eds), Cam b1idge, Massach ussets: M IT Press. 
Quinlan, J .R. and Rivest, R.L. ( 1 989) InfeiTi ng decision trees using the minimum description length 
principle, Information and Computation. 80, pp 227-248. 
Quinlan, J.R . .  Compton, P.J. , Horn, K.A. and Lazarus. L. ( 1986) Inductive knowledge acquisition :  
a case study, in  "Applications of Expert Systems, Volume 1 " ,  J.R. Quinlan (ed), Wokingham: 
Addison-Wesley, pp 1 57 - 1 73 .  
Raveh, A.A. ( 1 989) A nonmetric approach to  linear discriminant analysis, Journal of the American 
Statistical Association, 84. pp 1 7  6- 1 83. 
Rawlings, J .O. ( 1 988) Applied Regression: A Research Tool, Belmont, California: Wadsworth. 
Rutter, C., Flack, V. and Lachen bruch. P.A. ( 1 99 1 )  Bias in error rate estimates in discriminant 
analysis when stepwise variable selection is em ployed, Communications in S tatistics, 20, pp 
1 -22. 
Safavian, S .R.  and Land grebe, D. ( 1 99 1 )  A s urvey of decision tree classification methodology, 
IEEE Transactions on Systems, Man and Cybernetics. 21 ,  pp 660-674. 
Salahuddin and Hawkes, A.G. ( 1 99 1 ) Cross-validation in stepwise regression, Communications in 
S tatistics - Theory and Methods, 20. pp 1 163- 1 1 82. 
Sankar, A. and Mammone, R.J. ( 1 99 1 )  Neural tree networks, in "Neural Networks:  Theory and 
Applications", R.J. Mammone and Y.Y. Zeeve (eds). Boston :  Harcourt B race Jovanovich ,  pp ? 
2 8 1 -302. 
SAS Institute Inc., Version 6.04, Cary, North Carolina. 
Schaffer, C. ( 1 993) Selecting a classification method by cross-validation, Personal Communication. ? 
245 
Schwartz, S . ,  Wiles, J., Gough, I. and Phil l ips ,  S .  ( 1 993)  Connectionist, rule-based and B ayesian 
decision aids: an emp irical compaxison, in "Artificial Intelligence Frontiers in S tatistics", D.J. 
Hand (ed), London: Chapman & Hall, pp 264-278 .  
Seber, G.A.F. ( 1 984) Multivariate Observations, New York :  John Wiley. 
Segal ,  M.R.  ( 1 988)  Regression trees for censored data. B i ometxics. 48, pp 35-47 . 
Shlien, S .  ( 1 990) Mul tiple binary decision tree c lassifiers. Pattern Recognition, 23, p p  757-763 
Snapinn, S . M .  and Knoke, J . O .  ( 1 985) An evalu ation of smoothed classification error-rate 
estimators, Technometrics, 27, pp 1 99-206. 
Snapinn, S . M .  and Knoke, J .O. ( 1 989) Estim ati on of en?or rates in discri m i nant analysis with 
selection of vaxiables, B iometrics, 45, pp 289-299. 
Sonquist, J.A. ( 1 970) Mul tivariate Model B ui lding:  The Validation of a Searc h  S trategy, Ann 
Arbor, Institute for Social Research,  University o f  Michigan. 
Tay lor, P.C. and S il verm an, B .W .  ( 1 993) Block diag rams and splitting criteria for classification 
trees, Statistics and Computing.  3. pp 1 47- 1 6 1 .  
Todeschini,  R. and Marengo, E. ( 1 992) Linear discrimi nant c lassification tree: A user-driven mul ti?
criteria c lassification method. Chemometrics and Intel ligent Laboratory S y stems, 1 6, pp 25-
35. 
Toussaint, G.T. ( 1 974) B ibl iography on estim ation of misclassification. I E E E  Transactions on 
Information Theory , 20, pp 472-479. 
Wahl, P.W. and Kronmal.  R.A. ( 1 977) Discriminant fu nct ions when covariances are unequal and 
sample sizes are moderate, B iometxics. 33. pp 479-484. 
Wernecke, K-0. and Kalb, G. ( 1 9 83)  Further resu l ts in estim ati n ?  the c lassi fication error in 
discriminance analysis, B iomerrical JournaL 25. pp 247-258. 
? 
Wernecke, K-0. , Kalb, G. and Sturzebecher. E. ( 1 980) Com parison of vari o u s  proced ures for 
estimation o f  the c lassi fication error i n  discri m i nance analysis.  B io metri cal Journal, 2 2 .  pp 
639-649. 
Wolberg, W.H., Tanner, M.A. ,  Loh , W.Y.  and Vanic hsetakuL N. ( 1 987) Statistical approach to fine 
needle aspiration diagnosis of breast masses, Acta Cyrologica. 3 1 .  pp 7 3 1 -74 1 .  
246 
ADDENDA 
p 5 ,  I 2 :  change " i s  defined to be" to "assigns a random observation x to  populat ion Tii if ' .  
p 5 , I - 5 :  change "Let us" to " In the case of two p-d imensional  mu lt ivariate normal 
populat ions (the general  mult ivar iate e l l ipso idal  case is  not cons idered here) ,  let 
us".  
p 5 ,  1 - 1 :  
p 6 ,  1 1 1 :  
in  formula  (2.2. 1 )  change "I-1 " to "Ii-1 " .  
omi t  "much". 
' ' 
p 7 : i n  the formu la  after ( 2 . 2 . 9 )  change "R1 (T) = <!>( . . .  ]" to "R1 (T) = Pr( . . .  ]" and change 
" ln (-r.117r2) .  
p 7 , 1 - 1 :  
p 9 , 1 1 1 : 
p 9 ,  / 1 6 :  
p42, 1 7 : 
after "cumulat ive" add "standard ised".  
" 
change " D -- < 0 V i > J." Jl ' . 
"' /\ 
change " D 1 2( x )" to " D 1 3( x )" . 
change "C and C " to "SC and SC " 1 2 1 2 . 
p48 , 1 -1 0: change "L = IJ + L(T)I2" to " E  + 1 /2" . 
p6 5 ,  / 5 : after "error rate" add "estimate".  
p66 , Note:  The d iscussion from ( 4 . 2 . 2 )  on is  restricted to mult ivar iate normal data us ing 
LOA 
p6 9 :  on a new l i ne after ( 4 . 2 .  7 )  add "where nii i s  the number of observations from nJ 
fa lsely c lassified to Tii, i :t j " . 
p74 ,  I 1 2 : after " (e) . "  add "The above model  involves a fu l l  factoria l  des ign .  it may have 
been better to adopt some form of fract ional  design,  so a l l owing a wider range of 
factors to be explored.  However ,  as is  a lmost a lways the case in s imu lat ion 
studies,  the possible range of factors that can be explored is  vast, so that a l i ne  
has  to be d rawn somewhere . "  
p7 4 ,  / -4 :  change "xii is  lognormal" to  " log (xii) i s  normal ( 0 ,  1 )" . 
p7 4 ,  / -3 :  delete "wh ich is  log  normal  ( 0 ,  1 )" .  
p74 : I - 1  On a new l ine add  " I n  summary, univar iate normal data, xi . was generated for 
each d imens ion j, then transformed to yi = exp(xi) .  F ina l ly ,  the  data was 
standard i sed g iv ing marg inals with mean 0 and standard deviation 1 ,  that is ,  
The rank ing of  observations has changed after standard isation thougm the zi are 
independent ( uncorrelated) as the xi are independent . "  \ ? 
p 7 5 ,  I - 3 :  after " ( 1 9 9 0 ) . "  add "The two tree-based methods were chosen because of their 
ready ava i lab i l i ty and representing two d ifferent approaches to t ree-based 
classifi cat ion.  LOA and QOA were selected because they are the two most 
common ly  used classif ication methods . "  
p 7 6 ,  1 - 1 4 : After "here . "  add  " l t  may  have been preferable to  have used a separate test set 
instead of the cross-va l idat ion method wh ich does introduce poss ib le  error. lt 
was decided to use cross-val i dation instead of an independent test set as the 
former is used more often in the rea l  world as large test sets are usua l ly  
unava i l ab le . "  
p82 , I 8 :  after "was used . "  add "This decrease in  LOA error rate between normal  and 
lognorma l  is  most probably due to the effects of standard isat ion wh ich ma intains 
the theoretica l  covariance d ifferences,  hence distances between populat ions.  
However,  as the d istr ibutions of the two populat ions are skewed,  the lower 7 5 %  
.of the d istribution w i l l  be bunched together around a h igh peak,  thus closer to 
-the respect ive class mean than i n  the case of norma l ly d i str ibuted data. The net 
effect is that fewer observations are m isclassified for standard i sed lognormal 
data . "  
p83 , 1 -9 :  after "transformat ion . "  add  " I n  th is  case,  for pure lognormal ly  d istr i buted data , the 
actual  values of o wi l l  be d ifferent from those g iven in Sect ion 4 . 3 . 1 ' '  
p 8 3 ,  1 -7 :  after "true" add " (pure)". 
p 8 3 ,  1 -6 :  change " log[f( . )]" to ' l n[f(x)]" .  
p9 1 ,  I - 2 :  after "data" add " ,  though th is a lters the corre lations between variables so that 
the covariance matrices are d ifferent from those in ( 4 . 3 . 2 ) . "  
p92, 1 3 : change "covariance" to "variance". 
p 9 3 ,  I - 5 :  i nsert a transpose symbol ( ' ) between ")" and "S-1 " .  
p 9 4 , I -1 : on a new l i ne add "though absolute d ifferences are used i n  the g raphs to make 
for eas ier comparison between methods."  
p 9 6 ,  I 3 :  after "rates"  add "(to make the g raph easier to read)". 
p 9 7 ff:  "e" c lass if icat ions as on p 7 5 .  
p 1 1 6 : after 1 -7 add "wit h  ?i > 0 1  j = 1 1  2 . "  
p 1 1 7  I I - 1 3 :  after "the i rs . "  a d d  " B i nary data rather t h a n  general categ orica l  data was a l so 
used for s i m p l icity. U s i n g  categorical v a r i a b l es contai n i n g  m o re than two 
categori e s  wou l d  i nv o l ve creating a l arge n umber of bi nary vari a b l es to use in 
LOA and Q DA. " 
p 1 1 7  I I - 1 1 :  after "n" add "(total sample s ize)". 
p 1 1 7 1 I - 8 :  after "riik = 0 . 2 5 "  add " ?  j :t; k" 
p 1 1 9 : o m i t  " Leve l "  from Table 5 . 2 .  
p 1 2 1 : omit "pii = " from Table 5 . 4 .  
\ 
p 1 22 1  I - 1 0 :  after "of R(T) . "  a d d  "The mean square error ( M S E )  criteri o n  was used to 
compare d iffe rent e rror rate esti mators for each method (see p 1 0 8 ) ' 1  
p 1 2 6 :  o m i t  " Leve l "  from T a b l e  5 . 7 .  
p 1 3 0 1  I -7 : ch a n g e  "of t h e  s a m p l e "  to "associ ated with the classificat i o n  tre e " .  
p 1 32 1 I - 2 :  ch ange " t h e  F - ratio s h o u l d  not b e  regarded as a true measure of the statist ical 
s i g n ifi cance of each resu l t . "  to "a stat i st i ca l l y  s i g n ificant F -rat i o  m ay not be of 
su bstantive s i g n ifi ca n ce . "  
p 1 6 5 1  I - 9 :  change "vari a n ces" to "cova riance matri ces". 
p1 7 1 1 I - 1 2 :  chan g e  "vari b l e" to "vari a b l e  ... 
p 1 7 8 :  i n  Ta b l e  7 . 6  ( a nd a t  t h e  bottom of p 1 7 8 ) 1  cha nge "R( ilj )" to " L:jR(i/j )" .  
p 1 8 4 1 1 1 6 : after " s i g n ifica nt" a d d  "(with u n i variate tests)" . 
p 1 8 6 1  I 1 4 : convert "chan ge" to "changed". 
p 1 8 7 :  N ote: H u sb_Edu a n d  Wife_Edu a re o rd inal categorica l  vari a b l e s  with v a l u e s  
rang i ng from zero to seven and refer to t h e  education level o f  the husband a n d  
wife respect i v ely. N o_ C h i l d  refers to the numbe r  of ch i l d ren in the fa m i l y .  
p 2 0 9 1  1 -8 :  o m i t  " e n o u g h " .  
p2 3 1 1 / 1 1 :  c h a n g e  " s i g n ifica nt" to " m aj or". 
p 2 3 6 :  on a new l i n e  after / 1 2 add " U S  2 3  univariate s p l it". 
P 2 3 7  I 8 . change "C C " to "SC SC " I ? 1 1  2 1 I 2 ? 
p2 38: 
p239: 
o n  a n ew l i ne after / 1 2 add "nii 69 the n u m be r  of observati o n s  from rri fa l s e l y  
classified t o  IT/' .  
after " B re i m a n ,  . . .  ( 1 984) . . .  " add the reference "B rown, D .E . ,  C o rru b l e ,  V.  a n d  
P i ttard ,  C. L .  ( 1 993)" A comparison of dec i s i o n  tree class ifi e rs with b a ck?
propagation neural n etworks for m u l t i m o d a l  c l a s s ifi cation p ro b l e m s ,  P atte rn 
Reco g n it ion, 26, pp 9 53-96 1 . "  
p239 ( 2 ) :  Note: The page n u m bers for the B i b l i og raphy a re wrong. Th i s  s e?t i o n  s h o u l d  
start o n  p24 1  a n d  a l l  oth e r  pages shou l d  be put back two pages.