Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere without the permission of the Author. A COMPARISON OF TREE-BASED AND TRADITIONAL CLASSIFICATION METHODS A thesis presented in partial fulfilment of the requirements for the Degree of PhD in Statistics at Massey University. Robert D Lynn 1994 ABSTRACT Tree-based discrimination methods provide a way of handling classification and d iscrimination problems by using decision trees to represent the classification rules. The principal aim of tree? based methods i s the segmentation of a data set, in a recursive manner, such that the resulting subgroups are as homogeneous as possible with respect to the categorical response variable. Problems often arise in the real world involving cases with a number of measurements (variables) taken from them. Traditionally, in such circumstances involving two or more groups or populations, researchers have used parametric discrimination methods, namely, linear and quadratic discriminant analysis, as well as the well known non-parametric kernel density estimation and Kth nearest neighbour rules. In this thesis, all the above types of methods are considered and presented from a methodological point of v iew. Tree-based methods are summarised in chronological order of in troduction, beginning with the Automatic Interaction Detector (AID) method of Morgan and Sonquist ( 1 963) through to the IND method of Buntine (1992). Given a set of data, the proportion of observations incorrectly classified by a prediction rule is known as the apparent error rate. This eiTOr rate is known to underestimate the actual or true error rate associated with the discriminant rule applied to a set of data. Various methods for estimating this actual error rate are considered. Cross-validation is one such method which involves omitting each observation in turn from the data set, calculating a classification rule based on the remaining (n- 1 ) observations and classifying the observation that was omitted. This is carried out n times, that is for each observation in the data set and the total number of misclassified observations is used as the estimate of the error rate. Simulated continuous explanatory data was used to compare the performance of two traditional discrimination methods, linear and quadratic disc1iminant analysis, with two tree-based methods, Classification and Regression Trees (CART) and Fast Alg01ithm for Classification Trees (FACT), using cross-validation error rates. The results showed that linear and/or quadratic discriminant analysis are preferred for normal, less complex data and parallel classification problems while CART is best suited for lognormal, highly complex data and sequential classification problems. Simulation studies using categorical explanatory data also showed linear discriminant analysis to work best for parallel problems and CART for sequential problems while CART was also preferred for smaller sample sizes. FACT was found to perform poorly for both continuous and categorical data. Simulation studies involving the CART method alone provided certain situations where the 0.632 error rate estimate is preferred to cross-validation and the one standard error rule over the zero standard error rule. S tudies undertaken using real data sets showed that most of the conclusions drawn from the continuous and categorical simulation studies were valid. Some recommendations are made, both from the literature and personal findings as to what characteristics of tree-based methods are best in pa1ticular situations. Final conclusions are given and some proposals for future research regarding the development of tree-based methods are also discussed. A question worth considering in any future research into this area is the use of non-parametric tests for detennining the best splitting variable. ? l I C . - ?- ? -' -? ii ACKNOWLEDGEMENTS Firstly, I would like to thank my three supervisors, Associate Professor Dick Brook, Mr Greg Arnold and Dr S Ganesalingam for their constant support and helpful advice throughout my PhD study. I would also l ike to thank Mum and Dad, Judith and Robin Lynn, for providing me with cheap board and lodgings over the years as well as encouraging me to persevere to the end. I am indebted to Massey University for the use of their computer facilities, and in particu lar, to the Department of Statistics for providing me with employment over the past five years. Last, and by no means least, I owe a great deal of thanks to Paula McMillan for her efforts in typing this thesis, without her skill in reading my often illegible script this thesis may never have been completed! iii ADDITIONAL PUBLICATIONS Ganesalingam, S and Lynn, R D ( 199 1 ) . Posterior probability based estimator for the overall error rate associated with a l inear discriminant function. Occasion al Publications in M athematics and Statistics, 23, M assey University. Lynn, R D and B rook, R J ( 199 1 ) . Classification by decision trees and discriminant analysis. New Zealand Statistician, 26, pp 1 8-26. Lynn, R D, B rook, R J and Arnold, G C ( 1 993). A comparison of four classification m ethods: l inear and q uadratic discriminant analysis, CART and FACT. M athematical and Information Sciences Report, Series B : 1, M assey University. iv Table of Contents 1 . IN"TROD UCTION ..................................................... ..................................................................... 1 2. TRADITIONAL DATA DISCRIMINATION M ETHODS .......................................................... 5 2. 1 INTRODUCTION .................................................... .... ...... ............ ....................................... 5 2.2 LINEAR DISCRIMINANT ANALYSIS ....... .................. ........................ ............................. 5 2.2.1 Stepwise D iscriminant Analysis ................................................. .. .......... .. ................. 1 1 2.3 Q U ADRATIC DISCRIMINANT ANALYSIS .............................................................. ..... 12 2.4 THE ROBUSTNES S OF LINEAR AND QUADRATIC DISCRIMINANT A NALYSIS .......................................... .......................... . . ...................................... ............. 12 2.4.1 Modifications to Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 .5 KERNEL DENSITY ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Kth NEAREST NEIGHBOUR M ETHODS ........ ............................................................... 17 2.7 C RITIQUES OF KERNEL DENSITY ESTIMATION AND KTH NEAREST NEIGHB OUR M ETHODS .............. . ...... .... ..................................................... 1 8 3 . A TAB U LAR COMPARISON O N TEN TREE-B ASED M ETHODS ............ ..... ...................... 2 1 3 . 1 O RIGINS OF TREE-BASED METHODS ................ ............... ......................................... 2 1 3.2 I NTRODUCTIO N ...................... .................................... ........................................ .. .. ......... 2 1 3 . 3 A ID ........................................... ...... .................. . . ......... . ................ .............. ........................ . 3 1 3.4 THAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 ID3 ...................... ............................................................ ............................................ ......... 36 3 .6 CHAID ........ . .. .......................................... .......... .......................... ....................................... 39 3.7 C ART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -............. ......................................................... 41 3 .8 C4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . 47 3 .9 FACT ............. . .. ................................................ ............ ....................... ................................ 50 3 . 10 KnowledgeSeeker. .................. .......................... .................. .............. ................................... 52 3 . 1 1 Splus Trees ( ) .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .. . .. . . .. . . ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3 . 1 2 IND ......... ............................ .................. ...................... .................................................... . .... 59 4. SIMULATION STUDIES INVOLVING CONTINUOUS DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4. 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 ERROR RATES . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 SIMULATION STUDY !. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 . 1 S tudy Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4 SIMULATION STUDY II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4. 1 Study Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4.2 Results ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ?? ???? ? ? ?? ?? ? ? ? ?? ? ? ? ? --?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 83 4.4.3 Summary and Discussion ? ? ? ? ? ? ? ? ?? ? ? ? ? ? --? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ????-? ? ? ?? ? ??? ? ? ? ? ?? ?? ? ? ?? ? ? 9 1 4.5 THE EFFECTS OF PRIORS ON ERROR RATES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5 . 1 Introduction ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ?? ? ? ? ?? ? ? ??? ?? ?-? ? --? -? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??? ?? ? ? ? ? ? ? ?? ? ?? ? ? ? ? ? ? ? ?? ? ? 93 4.5.2 Purpose of this study . . . . . . . . . . . . . . . . ? -? ? ? ? ? ? ? --? --? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ???? -? ? ? ? ? ? -? ? ? ?? ? ? ? ? ?? ? ? ? ? ? 93 4.5.3 Study Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ? -? ? ? ---? ? ? ? ? -? ?-?-? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? -? --? --? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 94 4.5.4 Results ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? -? ? ? ? ?? ? -? ? ? ?? ? ? ? ---? --? ? -? ? ? ? ? -? ---? ? ? ???? ? ? -? ? ? ? ? -? -? ? ? -?-? -? --??? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? -? ? ? ? ? 95 4.5.5 Summary . . . . . . . . . . . .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ? ---? ? ? ? -? ? ? ?? ----? ? -? ---? ? ?-? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? --? -? ? ? ? ? ? ? -? 104 4.6 SIMULATION STUDY III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . _? ? ? ? ? -? --? ? ?-? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? 107 4.6. 1 Introduction ?? ?? ? ? ? ? ? ? ? ? ? ?? ? ?? ? ? ? ?? ? ? ? ? ?? ??? ? ? -? ? ? ? ? ? ? ? ?? ? ?? ?-? ? ? ? ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? -??? ? ? ? ? ? ? ? ? -? ? ?? ? ? ? 1 07 4.6.2 Study Plan . . . . . . . . ... . . . . . . . . . . . . . . --? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ?? ? ? ? ? ? ?? ? ?? ? ? ? ? ?? ? ? ?? 107 4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ? -? ? -? ? -? ? ? ? --? ? --? ? ? ? ? -? ? ? ? ? ? ?? ? ? ? ? -? ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 1 2 4_7 CONCLUSIONS ?? ?? ? ? ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ?? ? ? ? ? -? ? ? ? ? ? ? ? ? 1 1 3 5 . S IMULATION STUDIES INVOLVING CATEGORICAL DATA . . . . . . . . . . . . . . -. . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 5 5 . 1 INTRODUCTION ? ? ? ? ? ? ? ? ? ? -? ?? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? --? ? ? ? ? ?? ? ? ? ? ? ? ? -? ? ? ??? ? ? ?? -----??? ? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ?-? ? ? ?? ? ? ? ? ??1 1 5 5.2 PREVIOUS STUDIES ? ? ? ? ? ?? ?? ? -? ? ? ? ? ? ??? ?? ?? ? ??? ? ? ? ?? ? -? ? ? ? ? ?? ? ? ? ? ? ? ? ? ?? ?-? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 1 6 5.3 SIMULATION STUDY I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ? -? ----? -? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ????? ? ? ? ? ? ? ? ? ? ? ? ?? ?? ? ? ? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1 17 5 .3 . 1 Study Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ?-? ?? ? ??? ? ? ? -? ? ? ?? ??? ? ? ? ? ?? ? ? ? ?????? ?? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? 1 17 5 .3 .2 Results ? ??? ?????????? ? ?? ? ?? ? ? ? -? ?? ? ? ? ? ? ? ? ? ? ? ?? ? ?? ? ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ???? ?? ? ? ? ? ? ? ? ? ? ?? ? ? ? ? ? ? ? ?? ? ? ?? ?? ? ? ? 1 1 8 5 .3 .3 Summary . . . . . . . . . . ? -? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ? ? ????? ? ? ? ? ?? ? ? ?? ? ? ?? ? ? ? ? ? ? ? ? ? ? ? ?? ??? ? ?? ? ? ? ? ??? ? ? ? ? -? ? ??? ? ? ? 1 2 1 5.4 SIMULATION STUDY II ................................. ............................................................... 1 22 5 .4. 1 Introduction ............................................................................................................. 1 22 5 .4.2 Study Plan ............................................................. ...................................... . . ........... 1 22 5.4.3 Results .......................... ........................ .................................. .......................... .. . . . . . 1 22 5.4.4 Summary ....... ........................................ . .................................................................. 1 26 5.5 CONCLUSIONS .... .................................... ....................................................................... 1 27 6. CART SIMULATION STUDY .................................. .. . ............ ....................... ......................... 129 6. 1 INTRODUCTION ............................................................................................................. 1 29 6.2 ERROR RATE ESTIMATION FOR CONTINUOUS DATA IN CART ................... ..... 1 29 6.2 . 1 Previous Studies ...................... .................................................. ............................. 1 29 6.2.2 Study Plan ...................... .................. . . ...... ...................... . . ....................................... 130 6.2.3 Results ............................. . . ................ . . . . .... ........ ....... .............................................. 1 32 6.2.4 Summary ............................................ .................. .................... ......... ..... ................. 143 6.3 ERROR RATE ESTIMATION FOR CATEGORICAL DATA IN CART ............. . . ....... 1 44 6.3 . 1 Study Plan ........................................ . . .... . . . .. . . . . . ....... ......................... ....................... 144 6.3.2 Results . . ....... ................................................ . . ...... . . ..... ................... .............. .... ....... 145 6.3.3 Summary ................................................ . . ............................................................... 1 5 1 6.4 THE STANDARD ERROR RULE IN CART ............. ........................ ............................. 1 5 1 6.4. 1 Previous Studies ................................ . . ...................................... .................... ......... 15 1 6.4.2 Study Plan . ............... . . .................... . . ..... . . . ...................................... ......................... 1 52 6.4.3 Results ............................ ? ............. . . .. . . . . . . ...................... . . . . ........ . .............................. 152 6.4.4 Summary ....... . . . .. ........................................ .......... .......... .......... ............................... 158 6.5 TRANSFORMATIONS OF ERROR RATES ....... ................................ ........................... 158 6.5 . 1 Study Plan .................................................... .................... . . .............................. ....... 158 6.5.2 Results . .............................................. ............................................. ......... ............... 1 59 6.5.3 Summary .................................................................................................. ............ ; . . 16 1 6.6 CONCLUSIONS ........................ ...................... ................................................................. 16 1 7 . CASE STUDIES ................................................................ ......................................................... 1 65 7 . 1 INTRODUCTION ........................................ .............. ........................ ............................... 1 65 7 .2 PREVIOUS STUDIES ........... ............ ........................................................................... . . . . 1 65 7.3 COMPARATIVE STUDIES .................. . . ....... ......... ................... ......... ............................. 166 7.3. 1 Methods and Data Sets ........ ....................... . ........................................................... 166 7.3.2 Cross-Validation Error Rate Results ....... ............................................................... 173 7.3.3 0.632 Error Rate Results .... .................. ............. ............................. ......................... 176 7.3.4 Individual Class Error Rates ........... . ........... . ........ . ....... .... . .... .................................. 176 7.3.5 The Standard Error Rule in CART .... ... . .. .............. .......... .............................. . . . . ..... 179 7.3 .6 Splus Trees( ) versus CART ..... ....... . ..... .. . . ... . .................. . . .... ... . .................... ......... 179 7.3.7 Summaiy .... .... .. ..................... . ........... . ................ ..................................................... 1 8 1 7.4 ILLUSTRATIVE CASE STUDY .. .......... .... ...... ............ .... ........................................ ....... 1 83 7.4. 1 Methods and Data . ............ ... ................... .................... ...... ........ . ....... ....................... 1 83 7.4.2 Linear Discriminant Analysis .... ... . . .......... .... .... ....................................................... 1 84 7.4.3 CART ........................................... ...... ......... . ............ ......... . .................... ................. 1 86 7.4.4 FACT .......... .. .... ..................... . .. . ....... . ..... ............ . ... .... ... . .... .. ................................... 1 89 7.4.5 KnowledgeSeeker.. ............. . . . . ............................ ........... . .... ....................... ...... ........ 192 7.4.6 Splus Trees( ) ............ . ...... .. .... ................... . .. .... .......... . . ............. . . . .. .......... . .... . ......... 203 7 .4.7 Summary ...... .. .................... ...... .... ... . . . ............ ............... . . ... .... ........ ..... . .......... ......... 208 8. WHICH CHARACTERISTICS OF TREE-BASED METHODS ARE PREFERRED ..... ..... ... 209 8. 1 INTRODUCTION ................. . ........ ............. ... .. ..... . ... . . ... . . . . ....... ................... ............... . ..... 209 8 .2 WHICH CHARACTERISTICS OF TREE-BASED METHODS ARE PREFERRED? . . 209 8.2. 1 The Method of Splitting ................... ... ...... . . ................. .......................................... 209 8.2.2 B inary versus Multiway Splits . .... . .... . ..... .. .. ... ..... . . .. . . .... . . ....................... . ............ ... 2 1 1 8.2.3 Univariate versus Linear Combination Splits ....... . ....... . . ........ . ............................... 212 8.2.4 Costs and Priors . .. ..... . . ... ....... . .. .. . . .. .... .... .. .. . . . . ..... ......................... .......... ... . .... ......... 214 8.2.5 Stopping Rules and Tree Pruning .. .... ...... ............. . .......................... ............. ... ....... 214 8.3 HUMAN COMPREHENSIB ILITY AND USER-FRIENDLINESS OF ... .................... ... 2 16 9 . CONCLUSIONS AND PROPOSALS FOR THE FUTURE .............. ....................................... 22 1 NOTATION INDEX .................................................. .................. ................................................... 235 BIBLIOGRAPHY ........................................................................................................................... 239 1. INTRODUCTION Data often arise in the real world involving many objects with a number of measurements (variables) taken from them. These measurements may be quantitative (continuous or discrete) or qualitative (ordered or unordered categories) . The latter may, in some cases, be defined by only two categOiies and are then binary variables. When more than two categories are involved, instances where the categories can be ordered in a meaningful way are known as ordinal variables, while examples where the categories have no natural ordering are defined as nominal variables. For example, plants may be measured for stem length, stem width and plant height. These measurements are all continuous. A medical study would usually contain information on a patients age, whether he/she smokes or not and whether there is a fami ly history of cancer or not. Age (to the nearest year) is a discrete quantitative variable while the other two variables are binary. A sample survey might ask questions relating to the respondents ' educational qualifications, attitudes to race relations and current marital status. Marital status is a nominal variable while the other two vatiables are ordinaL Often, the objective of such studies is to distinguish between several groups or populations based on the measurements collected. A botanist may be interested to know which measurements can best d is tinguish between two related species of plants. A medical practitioner would like to know what variables are best able to predict whether a person will develop cancer or not. A sociologist could be trying to determine if there is any relationship between a person's religious beliefs and various sociological and demographical variables. In such cases involving two or more groups or populations , a large number of methods are available to the botanist/medical practitioner/sociologist to handle the above types of data. The desired intention is that the methods will produce a set of classification or prediction rules, which are both accurate and informative, and serve as a basis for future decisions. The aim of this thesis is to study and compare the performance of classification methods, both tree-based and more traditional approaches, over a variety of data types with the main goal being to determine in which situations tree-based methods are the preferred approach. In Chapter 2, the focus is on traditional discrimination and the four most common methods for estimating the conditional density functions of each population in the data set, thereby approximating the B ayes rule. The four methods investigated are l inear and quadratic 1 2 discriminant analysis, kernel density estimation and Kth nearest neighbour rules. The first two methods are based on parameter estimates while the latter two are wholly non-parametric. A summary table is provided which compares and contrasts each of the above four methods. In Chapter 3, the focus switches to tree-based classification methods, whose c lassification rules are portrayed in the form of a decision tree. After surveying the foundations of the tree? based approach to classification, ten tree-based methods are presented from a methodological point of view, examining characteristics such as splitting cri teria, stopping rules and interactive and graphical ability among others, as well as critiques of each method from articles in the literature. To conclude the chapter, a summary table is presented comparing all ten tree-based methods. In Chapter 4, after surveying the various types of en?or rate estimates that are used in the field of classification, a number of simulation studies are carried out involving continuous explanatory data. In Section 4 . 2 , a comparison is made between two traditional d iscrimination methods, linear and quadratic discriminant analysis, and two tree-based methods, CART and FACT, in terms of overall accuracy, over every possible combination of five factors involving dimension, sample size, Mahalanobis distance between populations, distribution and priors-covariance structure. In Section 4.4, the same study p lan is used except one of the distribution types is changed in order to make comparisons with previous studies. Section 4.5 deals with the estimation of individual class error rates for each of the four methods and how these error rates are affected when the prior probabil i ties of c lass membership are altered. The final section of this chapter investigates the reliability of various error rate estimators for three of the methods for predicting the correct class of future observations of the same type. In Chapter 5, a comparative study is undertaken comparing the four methods used in Chapter 4 for categorical explanatory variables, in particular, five and ten-dimensional binary data. After providing a literature review of previous studies comparing classification methods for such data, a simulation study is carried out using overall accuracy as the measure of classifier performance. In Section 5.4, the reliability of vatious error rate estimators is determined, as carried out for continuous data in Section 4.6. Chapter 6 concentrates exclusively on the CART method. Firstly, in Section 6.2, the reliability of various en?or rate estimation techniques is investigated for continuous data. Data sets are of varying distances between populations. sample sizes and data structure. Four performance criteria are used to evaluate the error rate estimators. In Section 6.3 , the same error rate estimators are compared for the categorical data sets used in Chapter 5. In Section 6.4, the so called standard error rule used in CART is analysed while Section 6.5 explores the effects of transforming the error rates. Chapter 7, firstly, reports the results from a empi rical comparison of five classification methods for a number of real world data sets. In Section 7 .4, a case study is carried out using some family planning data from India in order to i l lustrate the approaches taken by linear discriminant analysis and four tree-based methods. Chapter 8 , firstly, compares the approaches taken by tree-based methods to grow a classification tree, through both a survey of the literature and the results of simulation and case studies undertaken in this thesis. Secondly, a subjective comparison of traditional discrimination and tree-based methods is made. A summary of the literature where critical assessment of the interpretability of the two approaches is presented. This is followed by a personal assessment of which method(s) provide the most interpretable and humanly comprehensible models, based on the results of simulation and empirical studies presented in this thesis, as well as personal experience. In conclusion, a set of recommendations is made, based on the findings of this thesis, as to which methods should be used in which situations. Some proposals for the future development of tree-based programs and research are also presented, after tracing the links and developments of tree-based methods. 3 4 2. TRADITIONAL DATA DISCRIMINATION METHODS 2. 1 INTRODUCTION The optimal rule of classification in a p-dimensional, k-class problem is the Bayes rule which is defined to be (2. 1 . 1 ) where fi(x) is the conditional density of x, given that x belongs to class i and 1ti is the prior probabil i ty that x belongs to c lass i . The optimal rule for the proportion of observations falsely classified is called the Bayes misclassification error rate. This is calculated as (2.1.2) It is very unusual, however, for either the fi(x) or the 1ti to be known. The 1ti c an easily be estimated by class sample proportions but the fi(x) are another matter. This chapter focuses on the four most commonly used methods for estimating the fi(x), thereby approximating the Bayes rule. The four methods, which attempt to correctly classify a random observation into one of k classes, are linear and quadratic discriminant analysis, kernel density estimation and Kth nearest neighbour rules. The methods are described both algebraically and in words. A table of the assumptions and properties of the four methods is presented in conclusion. 2.2 LINEAR DISCRIMINANT ANALYSIS Suppose that an object is to be allocated to one of two p-dimensional multivariate ellipsoidal populations on the basis of an observation vector x. Let us assume that observations from the first population, f11, occur in a proportion n1 and the remainder are from f12 in the proportion n2 = ( 1 - n1). Let fi(x ) be the multivariate density of x in ni, with mean P.i and covariance matrix Li, where (2.2 . 1 ) 5 6 Suppose that we assign x toi11 if x is in some region A1 and to I12 if x is in a region A2 where AI and A2 form an exhaustive and mutually exclusive partition of the sample space, that is, Pr(A1 n A2) = 0 and Pr(A1 u A2) = 1 . Then, the total probability of misclassification, T(A, f), is the proportion of observations from A I that are falsely classified as belonging to A2 and vice-versa. Thus T(A, f) = 1t1 I A f1 (x) dx + 1t2 I A f2(x) dx 2 I = 1ti[l -I A fi(x)dx] + 1t2 I A? f2(x) dx I I = 1ti +I A [1t2 f2(x) - 1t1 f1 (x)] dx I (2.2.2) T(A, f) will be a minimum if 1t2 f2(x) - 1t1 fi (x) < 0 for all observations in A1- That is, the minimum error will occur if the product of the class priors and density functions in A 1 is much larger for I11 than for I12- With the assumption that LI = I.2 =I., that is covariance matrices are equal, the optimal rule of allocation D(x) assigns x to I11 if (2.2.3) otherwise X is assigned tO ill> where the likelihood ratiO f1 (x)/f2(x) is given by (2.2.4) Taking logarithms produces the rule: assign x to I11 if (2.2.5) otherwise to I12. The above quantity, D(x), is known as the true discriminant function. D(x) is a linear function of x. Now if x is multivariate ellipsoidal then D(x) will also be multivariate ellipsoidal thus the means and variances of D(x) can also be used to calculate the estimated error rates from using D(x) as the allocation rule. Now E[D(x) I TI J] is the mean value of D(x) given that X is from n 1? Thus E[D(x) I OJl = [Jl1 -t (Jl1 + Jl2)]' I-1(Jl1- Jl2) = ? (Jl1- Jl2)' I-1(Jl1- Jl2) = 1. ?)2 2 where 82 is the square of the true Mahalanobis distance between 01 and TI2. Similarly The common va1iance can be calculated thus: E[D(x)- D(Jli)]2 = E[(x- Jli)' I,-1(Jl1 - Jl2)]2 = E[(Jl1 - Jl2Y I-1(x- Jlj)(x- Jlj)' I- 1CJl1- Jl2)] = (Jl1- Jl2)' I,-1 E[(x- Jli)(x- Jlj)'] I,-l(JlJ - Jl2)] = (JlJ - Jl2)' I,-1(Jl1- Jl2) as E[(x- Jli)(x- Jli)'] =I = ?)2 Let R 1 (T) be the probability of misclassifying an observation from 01, so that R1(T) = Pr[D(x) < ln(rr2/n1) I x ? OJl Under the assumption of normality (2.2.9) can be expressed by R (T) _ ?? D(x)- E(D(x)) (n2/1t1)- E(D(x)) J 1 - ? se[D(x)] < se[D(x)] and where (.) is the cumulative normal distlibution function. (2.2.6) (2.2.7) (2.2.8) (2.2.9) (2.2. 10) (2.2.11) 7 8 If a sample of size n1 is drawn from f1 1 and size n2 from IT2 then J.l.i can be replaced by the Tij sample estimate xi = ? xi/ni, (i = 1 , 2) and I by the estimate of the pooled sample variance, J SP, given by SP = [(n1- 1 )S1 + (n2-1)S2J I (n1 + n2 - 2) (2.2. 1 2) where Si are the estimates of Li, (i = 1 , 2). If these sample estimates are placed into equation (2.2.5), then the optimal sample allocation rule is to assign a random observation x to IT 1 if (2.2. 1 3) D(x) is the linear discriminant function, the sample estimate of D(x). This discrimination rule assumes that the cost of misclassifying an observation from f1 1 to IT2, C(21 1) , is the same as misclassifying an observation from f12 to IT1, C( l l2). I f C(21 1 )-:?:- C( l l2), then (2.2 . 1 3) takes the form given so that an observation x is assigned to n 1 if D(x) > C( l l2) ln rr2 I C(21 1 ) In rr1 (2.2. 14) otherwise to IT 2. In the case of equal a priori probabilities of belonging to a certain class, (2.2 . 1 3 ) simplifies to "classify X to IT1 if D(x) > 0", otherwise to IT2- An alternative way of tackling the problem of classification in the linear discriminant analysis context is to use the group classification functions, f.;Cx), where L(x) = ln 1t- + x'? s?1 (x _l x?) 1 1 1 p 2 1 l 1 _, s?I _ _, s-I = n 1ti - -;;- X i Xi + X i Xi ? p p = a+ b' x. (2.2. 1 5) (2.2. 16) In the case of k groups, there are k group classification functions, so the rule is to assign x to Dmif ? ? l..m(x) = maxi l;(x), i = 1 , . . . , k (2.2. 17) The above Li(x) can be used to form what has previously been called the l inear discriminant function, D(x), in the case of k = 2 groups, where D(x) = L1 (x) - ?(x). (2.2. 1 8) In the case of k ? 3 groups, the situation becomes more complex. A set of linear discriminant functions, DijCx), can be defined as A A A DijCx) = 4Cx) - LjCx), i, j = 1 ' . . . , k (2.2. 1 9) Jennrich ( 1 977) calls these type of l inear discriminant functions, group separation functions. In general, there are c; = k!/2 ! (k-2)! = k(k- 1 )/2 such group separation functions in a sample consisting of k distinct classes. The rule is to assign x to ni if Dij(x) > 0, 'tf i < j . and Dij(x) < 0, 'tf i > j (2.2.20) Otherwise, x is assigned to one of the other (k- 1 ) classes. For example, in the case of k = 4 classes, the rule is Assign to nl if DljCx) > 0, j = 2, 3 , 4 Assign to n2 if DJ2(x) < 0, 623(x) > 0 and 624(x) > 0 Assign to n3 if DJ2(X) < 0, D23(x) < 0 and 634(x) > 0 Otherwise assign X tO 04. Figure 2. 1 i llustrates this procedure for a set of twenty six urinary samples (see data set R from Table 7 . 1 ) involving two chemical measurements (androsterone and etiocholanolone) taken from eleven healthy heterosexual and fifteen healthy homosexual males. Linear discriminant analysis is ideally suited to this problem, with the linear discriminant function (LDF), providing perfect discrimination between the two classes, showing that for men with the same values of androsterone, homosexuals have higher values of etiocholanolone than heterosexuals. Therefore. the separation is in a linear combination of the two variables. 9 Figure 2.1 Plot of Etiocholanolone (mg/24 hours) against Androsterone {mg/24 hours) from Urinary Samples for 26 Healthy Heterosexuals and Homosexuals with Linear Discriminant Function 5- (l) ? o 4 ? B - 0 ? ? - 0 3-,..q () 0 ?- __, ?2- B B B B / / --------- - --- I 1.5 B B B / / / / ,.,. ,. A / / B B B / / / / / B .--/ .-- / / / / .-- / / B.," ;/ / AA A A I 2.5 3.5 Androsterone / 8 LDF / / / / / / / / / B.," / / / / / A A A A A A 4.5 A = heterosexual s B = homosexuals 2.2 . 1 Stepwise Discriminant Analysis A special application of l inear discriminant analysis is stepwise d iscriminant analysis, whereby only a subset of the original p variables is selected to carry out the discriminant analysis. As above, suppose that a sample of dimension p contains n 1 observations from D 1 and n2 observations from D2. Variables are chosen to either enter or leave the model according to whether the Wilks-Lambda ratio of between to within class variance is greater than or less than a pre-specified significance level, while also taking into account the variables that are already in the model . Alternatively, the partial correlation coefficients between each predictor variable and the class variable can be used to force a variable to either enter or leave the model. In essence, the variables that contribute most to the discriminatory power of the model are selected to carry out the discriminant analysis. However, authors such as Habbema et al ( 1974) have pointed out that the best q variables selected by stepwise discriminant analysis, may not necessarily be the "best" variables for this type of data, just the best for this particular sample. Snapinn and Knoke ( 1 989) give i l lustrations where the apparent error rate, the error rate found from resubstituting the original sample, should never be used on a data set which contains only the best q variables, selected by stepwise discriminant analysis, though this has been found to hold for most discrimination methods. 1 1 2.3 QUADRATIC DISCRIMINANT ANALYSIS In real world situations, the requirement of equal covariance matrices is rarely satisfied, though the differences are often too small to cause any deterioration in the performance of l inear discriminant analysis. In cases where the covariance matrices are quite different, though, and x is p-dimensional multivariate ell ipsoidal, quadratic discriminant analysis is the appropriate method to use where the discriminant function is Q(x) = ln[f1 (x) I f2(x)] (2.3. 1 ) and we assign x to f11 if Q(x) > ln(n2/n 1 ) . If we again replace Jli and I,i by the sample estimates xi and Si then the result is the sample estimate of Q(x) , the quadratic discriminant function: Q ? c ) 1 1 [1s1? 1 c _ )' s?Ic - ) 1 c - )' 8-1c - ) x =- 2 n IS21j - 2 x - xl 1 x - x l + 2 x - x2 2 x - x2 (2.3.2) (2.3 . 3) 2.4 THE ROBUSTNESS OF LINEAR AND QUADRATIC DISCRIMINANT ANALYSIS 1 2 Simulation studies previously undertaken by authors such as Lachenbruch et al ( 1 973), Marks and Dunn ( 1974), Aitchison et al ( 1 977) . Krzanowski ( 1 977) and Wahl and Kronmal ( 1 977) , among others, have made many interesting discoveries about the robustness of linear and quadratic disc1iminant analysis, henceforth called LDA and QDA, respectively. Seber ( 1 984) has summarised many of these findings, noting in particular that: (i) LDA and QDA should perform equally well when covariances are roughly equal and the number of variables, p, is small (p::::; 6) . (ii) For small samples (n 1 , n2 < 25) and srriall cova1iance differences and/or p large, LDA is prefeiTed, but when both covariance differences and p are large neither method is recommended. (iii) QDA is better than LDA when both covariance differences and the number of variables are large, p > 6 and when sample sizes are large. It is suggested that n 1 = n2 = 25 and p = 4 as a minimum with 25 additional observations per c lass for every extra two dimensions. (iv) QDA should not be used in poorly posed situations, that is where the number of valiables is not much less than the c lass sample sizes, resulting in Si being a poor estimate of Li ? The extreme case occurs where the data is il l -posed, when p > ni meaning Si does not exist. From the above findings and many applications in case studies, both LDA and QDA should be best when each class is multivariate n01mal with equal covariances matrices and the ratio of class sample size to dimension is large. LDA is fairly robust to any departures from these conditions, while QDA is only robust to differences in the class covaliance matlices. Morgan and Sonquist ( 1 963), while introducing their Automatic Interaction Detector (AID) program, came to the conclusion that the usual parametric methods of c lassification were often inadequate in analysing survey data, noting in particular that parametlic methods were: (i) Unable to handle interaction effects, without adding many extra terms to the model, as interactions may be quite complex, affected in different ways by different parts of the data set. (ii) Variables may not have linear effects, thus there is the need to create many extra terms (for example, quadratic, cubic etc). (iii) Not good at handling categorical explanatory variables, especially those with many categories. Parametric methods usually treat categorical explanatory variables as a number of binary variables and c reate linear functions from those variables. As with (i) and (ii) above, the number of variables in the data set could increase dramatically and the data matrix will be sparse. (iv) Not robust to errors in the variables such as decimal points in the wrong place. As parametric methods make use of all the data at once, any errors in the measurements will lead to false classification rules. (v) Affected adversely by intercorrelations among the explanatory variables used in the analysis. These corTelations interfere with assessing the importance of individual variables. 13 2.4.1 Modifications to Linear Discriminant Analysis Friedman ( 1989) and Raveh ( 1 9 89) have tried two modifications to LDA in an attempt to solve the problems mentioned in the last section. Friedman developed a method called regularized discriminant analysis (RDA) especially for ill-posed situations, as outlined earlier. He noted that when the sample covariance is singular then the p - ni + 1 smallest eigenvalues are estimated to be 0. The net effect of this biasing phenomenon on discriminant analysis is to, sometimes, dramatically exaggerate the importance associated with the low variance subspace spanned by the eigenvectors corresponding to the eigenvalues near zero. Friedman tackles the problem using regularization. whereby a reduction in the variance of the sample-based estimates is caiTied out so as to minimise a potentially increased bias. Two regularization parameters, 0 ? A. ? 1 and 0 ? y ? 1 , are selected in order to jointly minimise future misclassification enors. The above two parameters are incorporated into a variance function that controls the degree of shrinkage of the individual class covariance matrices that contribute to the pooled estimate. Simulation studies showed that RDA was much better than LDA and QDA in cases where the covariances were spherical. In those cases where the covariance matrices were highly ellipsoidaL and equaL LDA did best but when covariance matrices were unequal. RDA did best. Raveh developed non-metric discriminant analysis (NDA). a method that requires none of the parametric assumptions required by both LDA and QDA (for example, the assumption of multivariate normality). NDA uses a separation measure so that as many observations as possible from n 1 are greater than or less than the observations from D2. Thus NDA is based solely on the ranks of the individual observations and not the actual values. Through simulation studies, Raveh has shown that NDA is error-free for non-overlapping distributions and that NDA outperforms LDA in cases where the distribution of the data is highly non? normal or where cova1iance matiices are quite different. 2.5 KERNEL DENSITY ESTIMATION 14 Often, it occurs that a parametric form cannot be assumed for the fi(x) so that in order to apply the likelihood-ratio test, the fi(x) have to be estimated using an unstructured approach. An example where this approach is necessary is in a sample exhibiting gross non-normality and unequal covariance matrices. Such an approach is called non-parametric estimation. 'Kernel density estimation is one fonn of non-parametric estimation. Hand ( 198 1 ), Seber ( 1 9 84) and Fukunaga (1990) all give excellent summaries of how kernel density estimation works. The basic idea behind kernel density estimation is to use the sample data (xij? i = 1 , . . . , k and j = 1 , . . . , ni) to estimate each of the fi(x)'s. Hand ( 1 98 1 ) first considers the case p = 1 . Suppose that v(xlf1m) is the number of sample points belonging to class m, 1 :5 m :5 k, with values less than or equal to x andF(xlf1m) is the estimate of the cumulative distribution and is given by F x/f1 _ number of class m observations :5 x ( m) -total number of class m observations (2.5. 1 ) This function cannot be differentiated because the probabilities are not continuous, but an approximation to the derivative of F(xlf1m) can be made. f x/f1 _ F(x + h/f1m)- F(x - h/f1m) ( m) - 2h [v(x+hlf1m)- v(x-h!f1m)J/nm = 2h This can then be rewritten as A 1 ? ko (x - xi) f(xlf1m) = -h L., h nm i=l (2.5 .3) where xj, i = 1 , 2, . . . , nm are the class m sample observations and {0, ko(z) = .!. 2' for lzl > 1 for lzl :5 1 where z = (x - xj) I h. (2.5.2) (2.5.4) 15 16 The above implies that every point in the interval (x-h, x+h) contributes l/2nmh to the estimation of the density function at x while any points that lie outside of that interval contribute nothing. It seems wrong that a point near the boundary of (x-h, x+h) carries the same weight as a point very close to x, while a point just outside of the interval contributes nothing. To overcome this problem, a smoothing weighting function is used. For instance, let ko(z) be from the normal distlibution with zero mean so that observations closest to x have the greatest weighting but all observations in the sample contribute to some degree in the calculation of the density function. Another alternative would be to use the uniform distribution as the weighting function so that every observation is equally weighted. "Any other unimodal density could be used as a kernel.., (Seber, 1 984, p 322.) Classification is determined by use of the likelihood ratio statistic, -ln(f1 (x)/f2(x)), and whether this value is greater than or less than a threshold value. In the case of two populations and p ? 2, the kemel density disc1imination function, K(x), is given by n2 ( l/n2) I, k2 (x - x2j) j=l An observation is assigned to n 1 if K(x) > ln (:?) (2.5.5) (2.5.6) otherwise to n 2. That is, the density estimates of fi(x) are based on the number of points from ni within the region (x - h, x + h) where h is a p-dimensional area. 2.6 Kth NEAREST NEIGHBOUR METHODS The Kth nearest neighbour method (K-NN) is another tool that is used whenever the c lass density functions, fi(x) , are unknown. In fact, this was the first non-parametric method for classification and was introduced by Fix and Hodges ( 1 95 1 ). The idea behind the method is relatively simple. Cover and Hart ( 1967) define a random observation X m, X m ? { x 1, . . . , x11 } , as the nearest neighbour to x if min d(xj , x) = d(xm, x) , j = 1 , 2, . . . , n . (2.6. 1 ) where d(xj , x) is a distance function. The nearest neighbour rule decides that x belongs to the class fl i of its neighbour Xm. The above is the single nearest neighbour rule, that is K = 1 , and only applies to the single nearest neighbour to x. All other observations are ignored. The idea is extended naturally to the K nearest neighbours of x . Lachenbruch ( 1 97 5 ) describes the general K-NN rule as follows. Suppose there are n 1 and n2 sample observations from TI1 and TI2 respectively. Suppose that the objective is to classify an observation x to one of TI 1 or TI2. Using a distance function, d(xij ? x), order the values, xij ? Let Ki be the number of observations from fli among the K closest observations to x. The rule is to assign x to TI 1 if (2.6.2) otherwise to TI2. In other words, the procedure involves the relatively s imple concept of assigning a random observation x to the class having the greater proportion of observations closest to x. As ni -7 oo, it has been found that (2.6.2) tends to the maximum likelihood rule. 17 2. 7 CRITIQUES OF KERNEL DENSITY ESTIMATION AND KTH NEAREST NEIGHBOUR METHODS 18 Simulation studies by various authors, including Habbema et al ( 1 974), have shown that the kernel density method for classification was just as efficient as LDA in the case of normally distributed data but when non-normality occurred, kernel density estimation was superior. Feng et al ( 1 993) have shown that the K-NN method were often slow in terms of running time, as was kernel density estimation. For most of the case studies tested in that paper, the K-NN method produced a very low apparent error rate but quite often the test sample error rate found from the classification rules on another set o f data that was not used to construct the classifier, was comparatively high. This fact calls into question the reliability of the classification rules proposed by the K-NN method. B reiman et al ( 1 9 84 ) , p 1 7 , have criticised both the above non-parametric classification methods on the following grounds: (i) They are sensitive to the choice of a metric l l x l l and there is usually no intrinsically preferred definition. (ii) There is no natural or simple way to handle categorical variables and missing data. (iii) They are computationally expensive as classifiers. The learning sample must be stored, the inter-point distances and classification !1lle recomputed for each new observation. (iv) Most serious, they give very little usable inf01mation regarding the structure of the data. That is, neither of the two methods provide a set of simple and intuitive set of classification rules. 2.8 SUMMARY TABLE OF THE ASSUMPTIONS AND PROPERTIES OF TRADITIONAL DATA DISCRIMINATION METHODS ISSUE Optimality Types of Variables Computations Discrimination Rule Critiques of the Method KEY Optima lily Types of Variables Computations Discrimination Rule Critiques of the Method LDA QDA Mult ivariate ell ipsoidali ty and Multivariate el l ipsoidality and equal covariance matrices within equal covariance matrices within each group. each group. Ouantitati ve Quantitative. Computations are based on class Computations are based on class sample means and t11e pooled sample means and individual covariance matrix of t11e class class sample covariances. sample covariances. Assign X to n 1 i f Assign X to n 1 if D(x) > In (1t2ht 1 ) Q(x) > I n (n2/n 1 ) where where D(x) = In [f1 (x)/f2(x)] . Q(x) = In [f1 (x)/f2(x)J . Ot11erwise, assign x to fh Ot1lerwise, assign X to n2. ? Very fast. ? Robust to departures from ? Robust to mild non-normali ty equal covariance matrices. in t11e variables. ? Not suited when the ratio of ? Unable to properly handle d imension to sample size is interaction effects. smaJ I. ? Not good at handling ? Very sensitive to departures categorical predictor variables. from normality in t11e variables. ? Not good at handl ing - - _ _f?tegorical predictor variaples. Under what conditions is the met110d opt imal? ror what type of explamtory variables is the method suited? KERNEL DENSITY No assumptions about the present distribution of variables. Ouantitati ve. Uses the individual data values and a weighting function k0(z). Assign X to n 1 i f K(x) > In (n2/n 1 ) where Ill L k t (X-X 1j) , c2) j= l K(x) = - . Ill 112 I, k2(x-x2j) j= 1 Otl1erwise, assign to Il2. ? Not affected by eitl1er non- normality or unequaJ covariance mat1iccs. ? Produces reliable classificatjon rules. ? Gives very l it t le usable infonnation about the data. ? Not good at handl ing categoricaJ predictor variables. What statistics/values are used in the calculation of t11c discrimination mles? How is an observation x classified to one of the two populations? From t11c l i terature, what arc four key advantages or disadvantages of the method? K-NN No assumptions about the presen t distribution of variables. Quantitative. Uses tJ1e K observations U1at are closest to x . Assign X to n 1 i f K 1/n 1 > K2/n2 where Ki is t11e number of class i observations among t11e K nearest neighbours to x. Otherwise, assign X to n2? ? Not affected by either non- normality or unequaJ covariance matrices. ? Produces unreliable classi fication rules. ? Gives very little usable infonnation about t11e data. ? Not good at handling categorical predictor variables. 20 3. A TABULAR COMPARISON ON TEN TREE-BASED METHODS 3. 1 ORIGINS OF TREE-BASED METHODS Tree-based methods of classification are children of the computer age. The idea of decision trees could only have been dreamed of before the introduction of the computer as the amount of number crunching required to construct a data-intensive method, such as a decision tree classifier, would have been far too much for the simple adding machine. The ideas behind decision tree methods were originally developed by Belson ( 1 959). The approach he proposed to take was the binary segmentation of a data set, in a recursive manner, so that each of the subgroups formed would be as homogeneous as possible with respect to the response variable. At each stage of the analysis the predictor variable providing the "best" dichotomous pa11ition would be chosen to partition the subgroup into two further subgroups. Belson ' s proposals form the foundations from which all tree-based methods have been built, being the result of a dissatisfaction with standard statistical techniques. In conclusion, Belson states, " [t]he method as I have described it is, it is true, a movement towards a more empirical way of doing things; but it is just as much a movement away from a sophistication which is too often either baffling or misleading" (Belson, 1959, p 75) . 3.2 INTRODUCTION In this chapter, ten tree-based methods are to be summarised in chronological order of introduction. The methods are: (i) AID Automatic Interaction Detector. (ii) THAID THeta AID (iii) ID3 (iv) CHAID CHi-squared AID (v) CART Classification And Regression Trees (vi) C4.5 (vii) FACT Fast Algo1ithm for Classification Trees 21 22 (viii) KnowledgeSeeker (ix) Splus Trees() (x) IND The ten methods are to be tabulated on the following bases: Author(s) Introduction Classification/Regression Tree Growth Tree Pruning Validation Procedures Interactive Ability Graphical Ability Who developed the method? The year the method was introduced and a short summary of how the author(s) describe(s) the method. What type of response variable is handled? Is the tree grown on all the data or only on a subset of the data? (a) (b) (c) (d) (e) Splitting Method. What rules are used by each method to partition the data? Type of Splits. How does the method partition the data? B inary/Multiway splits on a single variable (US) or a linear combination of variables? Costs/Priors. Are these incorporated into the splitting algorithm? Stopping Rules. What types of stopping rules are employed? Node Classification/Prediction. How are the nodes classified/predicted? What pruning procedure, if any, exists in the method? Is there validation of the decision rules constructed by means of a test sample or cross-validation? Do facili ties exist in the program for the user to easily interact wi th the tree-growing procedures, so as to automatically change the splitting variable, stop splitting, etc? Can the program display the decision tree graphically? One-Stage Optimality Missing Values Criticisms Examples in the literature Does the method only look for the optimal split of the current node? If not, do facilities exist to examine the effects of splits at the next one or two stages of the tree? growing process? How are missing values handled? What is written about the program in the literature? What problems have been identitied? A list of important papers using the method. Finally, a short, summary table comparing all ten tree-based methods is given, over all the attributes described above_ Safavian and Landgrebe ( 1 99 1 ) have conducted a survey of a large number of tree-based methods. Their paper includes a summary table comparing each of the methods in terms of the assumptions each approach makes, their performance criterion and some of the specific requirements for each method. The approach adopted in this chapter is intended to be more than a mere enumeration of material or collated bibliography of tree-based methods. Therefore, only ten such methods have been selected and presented in detail with some attempt at critical comparison. A major difference between the tree-based methods studied in this chapter is the way in which the aims of Belson are carried out, that is, the method of splitting . A decision tree procedure either uses binary splitting where the data is segmented into two groups, or uses multi way splitting , where the data can be split_ ipto more than two groups. These splits can either be carried out on a single variable, called univariate splits (US), or on a l inear combination of variables. Figure 3 . 1 illustrates the method of binary splits for Fisher's Iris data (data set H from Table 7 . 1 ) involving three species of iris (L virginica, L setosa and L versicolor) each with 50 cases and measurements taken on four variables (sepal length, sepal width, petal length and petal width). Although the problem involves four variables, only two variables are used to form the tree using the CART algorithm (see Section 3.7) . The first split is on petal length and 23 24 asks the question as to whether petal length < 1 .95, and if so, observations are sent to the right. It would be possible for the next split also to be on petal length, but here the next split is whether or not petal width < 1 .75, for those cases where petal length > 1 .95. The tree produced by CART is shown in Figure 3.2. As only two splits were made, there were three terminal nodes, where a node is defined as a subset of the data and a terminal node is a terminal subset of the data which is assigned to one of k classes. The terminal node at top left consisted of 50 class 1 (I. v irginica) t1owers and was classified as class 1 . The terminal node at bottom left was classified as class 2 (I. setosa) and consisted of 49 class 2 and 5 class 3 flowers (I. virginica). The terminal node at bottom right was classified as c lass 3 , consisting of 1 and 45 in c lasses 2 and 3 respectively. Notice that there were 6 flowers overall misclassified by the classification tree. Figure 3 .3 illustrates the method of multiway splitting for the same data. The FACT algorithm (see Section 3 .9) was used to partition the data. In this case, there is only one split carried out, that being on petal width, but it is a three-way split, dividing the data into three subgroups. The first subgroup corresponds to the case where petal width < 0.787 (split l a) , while the second subgroup corresponds to t1owers where 0. 787 < petal width < 1 .677 (split l b) , with the third subgroup corresponding to all those cases where petal width > 1 .677. The FACT tree is shown in Figure 3.4. Basically, the FACT tree has split the data into three homogeneous terminal nodes using only one multiway split, compared with the two needed in the b inary splits example. As with Figure 3.2, 6 t1owers have been misclassified by this c lassification tree. The above examples were both carried out using only one variable at a time. Figure 3.5 provides an illustration of a linear combination split, when used with CART. The first split is the same as that of Figure 3 . 1 . The second split, however, involves both petal length and petal width and asks the question, for those cases where petal length > 1 .95, as to whether 0.209 * petal length + 0.977 * petal width < 2.5 1 . The CART tree for this example is given in Figure 3.6. The major difference from Figure 3.2 is that no class 3 cases were misclassified, and two fewer cases overall were misclassified. ,. o' ' '' ?? ? ? '- A O o ' o o o o - -?? ???? OOo 0 , , ,,.,_, 0 0 ? - ? ? ? , ? ? , 0 ,? "' ''' o ? -? ? ? , , ... .. _ . ?-- - -.. -?-?- ?? ,_, Binary Spl i ts Example : Plot of Petal Length agains t Petal Width for Fi sher ' s I r i s Data using CART ?? ?- ?? ? - . - -?pl i t _ 1 -?-- . -? ? ? - . . 2 .5 -i r . . ? ?? ? ? ? ? ? . . . . . ? ? ? ? ? ? ? ??c?? - ? ?? 2 .0 -? l Cx::.tt c c cc i c c ! ? _ _ _ _ _ _ _ _ _ _ _ c_ _ _ _ _ _ lt..AL-\..L. _ _ c_ _c_ c _ _ c._ _ _ _ _ l sp' i t 2 :? 1 . 5 _11 li B C ? ll= B B C i ? B _ B I ? 1 . 0 --? l B ? I I A ! o .5 -1 uJ ? o .o ?A?Jr ? ---- -- -r-- --????? ???-r???? .. ??--?--?- -?-???r? ?- -- - - ---. . ?- ?-?? ?-r .. - - ?- - ? - --rj 1 2 3 4 5 6 7 ' Petal Length A = ! . v i rg in ica ! B :::.- l . setosa ! Figure 3. 1 L ... . .. _ . - ?? -- ? . . . . . . . . . .. .. . . ..... . ..... . . _ . . C = I . vers ico lo r i ? - ?- __ . . _ . _ . . . .. . . . , . . .. _ . .... ... - . . .... - .. . . .... .... . . ... .. .... . - . .. ... - ?? - . . . - . ? .?? . ?J 26 50 50, 0, 0 0 n n ? I r(t) petal length < 1 .95? petal width < 1 .75? n = sample size at each node 54 0, 49, 5 0.09 2 ni = sample size for class i at each node 46 0, 1 , 45 0.02 3 r(t) = purity measure at each node (that is, the proportion of observations not from the class with the largest number of observations at each node) Circles represent decision nodes which have to be split on while rectangles represent terminal nodes which are assigned to a particular class given below the node. Figure 3.2: CART Tree for Fisher's Iris Data r???? . .. . . . .. . . . . ... .. . . . . . . . - ?- . . . .. . . . . . .. . . . . . - . . .. . . I I I I I I I I I I") 5 -? ;_ , Multhray Spl i ts Example: Plot of Petal Length against Petal Width for Fi sher ' s Ir i s Data us ing FACT i .. .. .. . .. . . . . . . ... .. . .... . . . . .. . . . ..... . . . .... . . . . . .. . . . .. . . . . I I ... .. . . ........ - - " . ??- . .... . ... . .. .. , c c c cc c cc c i .. ...... .... . . . ... . . ...... .. . . . . . . . . . l I I :S 2.0 - t - . . - . c ? 1 . 5 ---1 c - .. . . lSp l i t 1 b i I - 8 B ? c ' I t 1 . 0 - ? B B 8 8 fr"'!:l? ? ?-- . . . , . . . .... . . . . . ... - .. .... __ -- .. ... . .. . . . .. -... .. . . ... . . . . -.. --- . .... . . . . . .. . . . . . . . . . . - . _,_ . .. .. . . .. . ? - - - ?-- - ?? ? ? - - ? . . . ? - - - ?Sp l i t 1 o I 0 .5 I 0 . 0 .. . . i I I ???u?e 3 .3 AA 't..J!i A AA? A _ --r-- ------??------- - r ?-- ? -- -- ?-?- -. . - - 1- . ?-? -- ------ . . . . ... . T .. ? - - . - ... ..... ..... .. ... r--? ?- .... .. . . ... . ..... . ..... r-?- - -- -- ?-- ----,.i 1 2 3 4 5 6 7 Petal Length A : .-:: I . v i rg in ica B = l . se tosa C :::!: I . ve r s ico lo r 1 . .. - - - -- . ... . ... ....... . .. . .. . ..... . . . ... . . J 28 I (0, 0.787) 50 50, 0, 0 0 1 pe? width (0. 788, 1 .677) 52 0, 48, 4 0.08 2 Figure 3.4: FACT Tree for Fisher's Iris Data n n ? I r(t) I ( 1 .688, oo) 48 0, 2, 46 0.04 3 2.5 --i 2 .0 - ? ? I ? 1 . 5 -j ? 1 . 0 -? 0 .5 I A I A AJ ? A JfJJ A B B '"' B - -- ?C - . . . . . . ??--. I c c c c c c c c c c ..... i Spl i t 2 . . .. . . ... ... "1 0 .0 A AA ' ---!_,..?-??-?? ? . ... ----???????- ..lr ..... ........ ..... ... -. . . r_ . . . . ____ ____ ...... . ..... ?" ? ... ... ... . ..... - ? .. .. ? ?- ?r . . -- . . .. . . ??? . .. . . . . .. r ? ? . . .. .... .. ..... - --.. rJ I I L?i?ure 3 .5 . ? . ... ... . . . . .. . 1 2. 3 4 5 6 7 Petal Length A :.::: I . v i rgin ica 8 =-? 1 . set os a C :.:-:: ! . vers 1 co !o r 30 50 50, 0, 0 0 petal length < 1 .95 n n ? I r(t) 1 0.209 * (petal length) + 0.977 * (petal width) < 2 .5 1 46 0, 46, 0 0 2 54 0, 4, 50 0.07 3 Figure 3.6: CART Tree for Fisher's Iris Data - Linear Combination Split 3.3 AID Author(s) 1 N Morgan and 1 A Sonquist (USA). Introduction The Automatic Interaction Detector was published in 1 963. The essence of the algorithm is the sequential application of a one-way analysis of variance model (Morgan and Sonquist, 1963). The purpose of the program was to handle interactions and inter-correlations among the data in a more explicit way. Classification/Regression Designed to perform regression using a continuous dependent variable, although dichotomous dependent variables can be handled by transforming one of the two categories into a proportion . Tree Growth: The tree is grown on all the data set. - Splitting Method Sonquist ( 1 964), summarises the four steps in the tree growing procedure as follows: (i) Choose, for splitting, the node, t, with the largest total sum of squares, TSS1= L,y; - (I,yr)2/n. (ii) Split each variable, xj , into two subgroups such that this division leads to the biggest decrease in unexplained sum . . . . B SS ( -2 -2) -2 ot squares, 1.e. max1m1se j = n 1Y 1 + n2y 2 - n1 Y 1 . (iii) Partition vatiable Xm over node t where BSSm is maxi BSSj . (iv) Return to step (i). ? - Type of Splits Binary splits are the only method used and are carried out on only one variable at a time. - Priors/Costs .No. - Stopping Rules Direct stopping rules are used. A number of different criteria exist for stopping tree growth: (i) TSS1 < R * TSS 1 , where R is a parameter 0 ::;; R ::;; 1 , and TSS1 is the total sum of squares for the whole sample. (ii) BSSj < Q * TSS 1 over all xt 0 ::; Q ::;; 1 . (iii) Number of unsplit nodes > P. (iv) Sample size of each unsplit node < L. 3 1 32 - Node Classification/Prediction Tree Pruning Validation Procedures Interactive Abili ty Graphical Ability One-Stage Optimality? Missing Values All observations within a node are assigned the average value for the response variable in that node. No. No. No. No. Fielding ( 1 977) describes the r-step lookahead option used by AID Ill, the then current version of AID. The earlier version of AID grew the tree on a sequential basis so that the prediction error was only minimised at each step of the analysis, that is, "stage by stage optimization". The r-step lookahead option is an attempt to improve on the stage by stage procedure. For m predictors there will be m tentative splits under consideration. The best splits for each predictor on these subgroups is then obtained. One now has m2 possible two-stage trees under consideration. This process could be continued for r stages with mr possible trees. Clearly this lookahead option could involve a tremendous amount of computation and information storage were it not restricted. The current version of AID Ill limits the lookahead steps to three, including the first split (Ibid, p 249) . Morgan ( 1 993) , in a personal correspondence, noted that the repeated use of the lookahead feature failed to find any useful applications or examples, and was dropped in later versions of the program. One might think it would find offsetting effects, as when young men and old women are more likely to go to the hospital, but the sequential strategy seems to uncover these too according to Morgan. Missing values are replaced by class means estimated from non? missing values in the learning sample. Criticisms Examples in the Literature: The AID algorithm has been criticised by a number of authors, including Einhorn ( 1972), Doyle ( 1 973), Kass ( 1 975) , Doyle and Fenwick ( 1 975), Kass ( 1980) and de Ville ( 1 990) . The principal reasons for this criticism are: (i) It requires very large sample sizes, usually > 1 000 observations. (ii) It does not take the intercorrelations among the predictors into account. (iii) It is not robust to deviations from normality in the variables. (iv) The tree size is affected too much by noise in the data. (v) Only binary splits are carried out. (vi) Most importantly, there is no validation of the prediction rules constructed , either by testing for significance, or using an independent test sample. Morgan ( 1993) has responded to these criticisms. He believes that (ii) is wrong, except that once a split is made on one predictor, it may leave groups where a second predictor has lost whatever power it had, but that information is useful to know. For example, in searching for what makes people happy, the program splits first on the quality of the network of friends, then on health, and only then on income ! Of criticism (iii) , Morgan affirms that this is true of any least squares procedure, though AID alerts the user to isolated cases by splitting these off into a separate subgroup. Problem (v) is irrelevant according to Morgan, since multiple splits on the same predictor are possible, and it is wasteful to start with k subgroups when k- 1 will do. The loss of information from grouping data is small, and a very few subgroups contain almost all the information. The last criticism , c laims Morgan, is not a function of the program but of the user, who can always grow the tree on three quarters of the sample and see how well the final groups account for the variance in the other quarter. It must be noted, however, that it is wasteful to not use all the data in the tree growing phase and test sample estimates of error are highly varible for small samples. (See Section 4.2 and B reiman et al, 1 984.) Assael, H ( 1 970). Segmenting markets by groups purchasing behaviour: An application of the AID technique, Journal of Marketing Research, 7, pp 153- 1 58. Heald, J I ( 1 972). The application of the automatic interaction detector programme and multiple regression techniques to assessment of store performance and site selection, Operational Research Quarterly, 23, pp 445-457. Muxworthy, D T ( 1 972) . Review of AID Ill , B ri tish Sociological Association Maths, Statistics and Computing Applications Group Newsletter, 9. 33 3.4 THAID 34 Author(s) J N Morgan and R C Messenger (USA). Introduction Developed in 1 973, THeta AID was designed as an extension of the AID algorithm (Morgan and Sonquist, 1 963) to handle categorical dependent variables . It was " . . . v iewed as a simplified version of the present AID". (Messenger and Mandell, 1 972, p 1 8 .) Classification/Regression Designed for classification specifically for use on nominal ly scaled dependent variables. Tree Growth : The tree is grown on all the data set. - Splitting Method Two methods for splitting are used. (i) Theta criterion or what Messenger and Mandell (ibid, p 1 2) , call "optimal prediction-to-the-mode strategy". The objective is to find the split at the unsplit node t which maximises: where nt = total number of observations in node t ni = total number of observations in the ith split group mi = total number of misclassified observations in the ith split group. Or else, the Delta criterion could be used. Messenger and Mandell, (ibid, p 1 5 ) , define this as " . . . based on the simple notion that one should find split groups whose probability distributions d i ffer maximally from the original group and hence from each other". The basic idea is to find the split on the variable for which k k byfx = nl L lpi - P lj l + n2 L lpj - P2j l . i= l . j=l - Type of Splits - Priors/Costs - Stopping Rules where and Pj = proportion of observations from class j in node t , j = 1 , . . . , k P lj = proportion of observations from class j in split group 1 . Note that the authors o f THAID recommend the Delta criterion for splitting i f the ratio of sample size of the largest group to the second largest group is greater than 2: 1 (Messenger and Mandell, 1 973). Only binary splits are carried out using only one variable at a time. No. Direct stopping rules are used. Stop if: (i) n.J2 < nmin? where nmin is a preset parameter, and (ii) either 8y/x < 8min or oy/x < omin? - Node Classification/Prediction Tree Pruning Validation Procedures Interactive Ability Graphical Ability One-Stage Optimality? Missing Values Assign a terminal node to the class with the largest number of observations in that node. No. No. No . No. Yes. Missing values are replaced by class means estimated from non? missing values in the learning sample. 35 Criticisms Most of the cntiCISms levelled at AID, are also valid for THAID. Basically the method does not know when to stop. Kass ( 1 980) , p 1 20, also states that, "[k]nowledge of the theoretical behaviour of the Theta criterion is lacking still". Morgan ( 1 993) has written in, noting that there is a maximum- likelihood x2 splitting option available in the new SEARCH program which has replaced THAID. The procedure is designed to maximise stability and remove the chance of erratic results. Examples in the Literature Morgan, J N ( 1990). A conditional analysis of movers ' housing responses. Journal of Economic Behaviour and Organisation. 3.5 ID3 36 Author(s) J R Quinlan (Australia) Introduction Introduced in 1 979, this procedure is in the family of recursive partitioning, tree-based algorithms, although it is from the machine learning rather than the statistical literature. Quinlan ( 1 983) , describes the method as " . . . recover[ing] valuable information from large masses of low grade data by a process of inductive inference". Classification/Regression Handles classification problems only. Tree Growth: A subset of the original learning sample, called a 'window' is chosen at random and a decision tree formed that correctly classifies all observations in the window. All objects in the learning sample, but not in the window are then classified using this tree. If the tree gives the correct c lassification for all objects then this tree is declared optimal, otherwise some more observations are added to the window with the tree-growing and evaluation process being repeated. The process continues until all cases in the learning sample are correctly c lassified. - Splitting Method Splitting is achieved by means of an information measure. An observation is determined to belong to class 1 with probability p 1 = m/(n+m) and to class 2 with probability P2 = n/(m+n), where m and n are the number of observations from class 1 and class 2 respectively. The expected information needed to classify an object using a tree is : - Type of Splits - Priors/Costs - Stopping Rules I(m, n) = -P l log2 P I - P2 log2 P2? Let a variable, xj , considered for partitioning, contain v distinct categories {A 1 , . . . , Av } . The node t that is to be considered for splitting will be split into v descendant nodes, t1 , . . . , tv, each described by one particular category of xj ? The information required for the subtrees with ti is I(mi, ni), where mi and ni are the number of class 1 and 2 observations in the ith node. The expected information required for trees partitioned on xj at the root node is V E(xj) = L [(mi + ni)/(m+n)] * I(mi, ni) . i= l where the weight of the ith branch i s the proportion of objects in t that belong to lj. Information gained by branching on xj is ID3 examines all variables, xj , j = 1 , . . . , p , and chooses Xj to maximise gain(xj) ? This process is continued on the recursively found nodes, t 1 , . . . , tv. It is known as the gain criterion. Multiway splits are used here. In fact, splitting is carried out using every possible value of a variable. If a predictor variable is continuous, some form of clustering of the values is carried out before splitting. Only univariate splits are carried out. No. (i) Stop when all cases in the learning sample are correctly classified. (ii) An alternative stopping rule is: Use the x2 statistic to determine if the categories of variable xj are independent of those class of objects in S. No further testing of the variables (splitting) is done if that variables irrelevance cannot be rejected at a very high confidence level. - Node Classification/Prediction Assign a terminal node to the class with the largest number of observations in that node. Tree Pruning No. Validation Procedures No. 37 Interactive Ability Graphical Ability One-Stage Optimality? Missing Values Criticisms Examples in the Literature 38 No. No. Yes. Observations can either be discarded from the data set before splitting, or, alternatively, use the ratio of class sample sizes multiplied by what Quinlan calls a ' token' to find a p redicted value of the variable for which a particular observation is missing. Many of the problems inherent in AID have been also found present in this algori thm. deVille ( 1 990) discusses the following problems. (i) Biased towards the selection of variables with many categories though they may not be the best predictor. (ii) Do not know when to s top. 103 continues splitting on nodes with only a small number of observations. The resultant decision tree would not hold up in the real world, being principally a function of the data at hand. (iii) Overly large trees are too complex and not easy to understand. Quinlan e t al ( 1 986) , p 1 64, s ta tes that " [e] mpirical investigations have found that trees generated from such sets are usually simpler and more accurate than those constructed from random samples" . However, in the next paragraph, it is argued that " . . . decision trees produced by any top-down approach are more complex than can be justified by the data." (lbid, p 1 64). Schwartz, S , Wi les , J , Gough, I and Phillips , S ( 1 993) . Connectionist, rule-based and Bayesian decision aids: an empirical comparison, in "Artificial Intelligence Frontiers in Statistics", D J Hand (ed) , London : Chapman & Hall, pp 264- 278. 3.6 CHAID Author(s) G V Kass (South Africa) Introduction Developed in 1 980 as an offshoot of AID for use with categorical response variables. It was designed to tackle the criticisms of AID by " ... embedding the partitioning problem in a significance testing framework" (Kass, 1980, p 1 20) . Known as CHi-squared AID. Classification/Regression Handles only classitication problems. Tree Growth: The tree is grown on all the data set. - Splitting Method According to Kass, the splitting method proceeds as follows: (i) For each predictor variable in turn, with the dependent variable having k classes, cross-tabulate the categories of the predictor with the dependent variable. Go to step (ii). (ii) Find the pair of categories of the 2*k subtable that are least significantly different. Merge the two categories into one compound category if the significance does not exceed a critical value. Repeat this step until no more mergers can be found. (iii) For each compound category consisting of three or more of the original categories, find the most significant binary split into which the merger may be resolved. If the significance exceeds a critical value, implement the split and return to step (ii). (iv) Calculate the significance level of each optimally merged predictor and isolate the most significant one. If this significance is beyond a cri terion value, sp li t the data according to the merged categories of the chosen predictor. (iv) Return to step (i) for each, as yet, unsplit node t. - Type of Splits Multiway splitting can be used with this method, for example, three-way, four-way or larger splits. Single variable splits only are carried out. - Costs/Priors No. 39 - Stopping Rules Tree Pruning Validation Procedures Interactive Ability Graphical Ability One-Stage Optimality? Missing Values Criticisms 40 Stop if: (i) ns < nmin? (ii) The split on the optimally merged predictor < X?g-l) , a on g compound categories for a preset value of a. No. No. A current version of CHAID runs on SPSS for windows. A series of menus with a mouse button allow the user to set values for the parameters used in the tree-growing process, and begin the analysis. The tree-growing process can be interrupted at any point and the values of the parameters altered. The graphical display of the tree works in unison with the tree? building p rocess. As a node is split into a number of sub-nodes, the results are displayed immediately on screen by means of a decision tree. The level of detail about each node and each split carried out can also he altered. Yes. These are excluded from the tree-growing process. No criticism of CHAID directly has been discovered in the literature. The weakness of the method, however, would appear to be the lack of any procedure for validating the results. As Einhom ( 1 972) , p 368, stated, eight years before CHAID appeared, " . . . the results should" be subjected to a more rigorous criterion than statistical significance or some other statistical . criterion". He noted, "replication is the backbone of science and when techniques capitalise on chance fluctuations in the particular sample at hand, it is imperative to replicate (or cross? validate) the results on a new set of cases" (lbid, p 368) . Another possihle criticism of CHAID is the use of a direct stopping rule. The authors of CART, Breiman et al ( 1 984), criticised this type of stopping rule on the following basis. If the significance level is set too high (large p-values), then there is too much splitting so that the tree is too large and just a reflection of the sample data. If the significance level is too Examples in the Literature 3.7 CART low, then one may cease splitting too early and declare a node as terminal when there still existed splits with large decreases in impurity. Author(s) L Breiman, J H Friedman, R A Olshen and C J Stone (USA). Introduction Breiman et al began work on recursive partitiOning in the 1 970's. Their work was completed with the publication of the CART monograph (Breiman et al, 1984). The purpose of the algori thm had the dual goals of providing a set of accurate decision rules in the form of a tree that were easily intepretable while seeking to solve the problems inherent in some of the earlier methods of the above type, such as AID and THAID. Classification/Regression Standing for Classification and Regression Trees, CART handles both numeric and categorical response variables. For comparison with the other methods and simplicity, everything henceforth will be described in the classification context only. Tree Growth: The tree is grown on the whole data set. If, however, the data set is overly large, a tree can be grown on only a subsample of the data. - Splitting Method At the root node of the tree, the splitting variable is chosen to maximise the class puri ty, that is, as many observations as possible are from the same class, of the two descendant nodes, these being the two sets of points that went either left or right when the chosen variable was split, as well as aiding the future growth of the tree. Two different criteria are available in CART to achieve the above two aims, namely the G ini and twoing splitting criteria. Breiman et al ( 1984) have found that the final classification tree generated is fairly insensitive to the choice of a splitting rule. The G ini splitting criterion works in the following way. Suppose at a node t, an object can be assigned to class i with probability p(i/t) while the estimated probability that the object is in c lass j is p(ilt), then the estimated probability of m isclassification under the Gini index is 41 - Type of Splits - Costs/Priors 42 i (t) = I, p(ilt) p(i/t). i;tj The Gini cri terion seeks to maximise the function !1i(s, t) = i(t) - PLi(td - PRi(tR) where PL and PR are the proportion of observations at node t sent left and right respectively by the split. The twoing criterion seeks to amalgamate the set of k c lasses into two superclasses, C1 and C2. The measure of goodness-of- split, !1i(s, t), is computed as if it were a two class problem. "The idea is then, at every node, to select the conglomeration of classes into two superclasses so that considered as a two-class problem, the greatest decrease in node impurity is realised". (Breiman et al, 1984, p 105) . The twoing splitting rule thus max1m1ses (s/t) = PL I, lp(j/td - p(j/t)l + PR I, I p(i/tR) - p(i/t)] . j j In general , Breiman et al state that Gini tends to split into one small pure node and one large impure node, that is, to separate the classes out one at a time. Breiman et al call this end cut preference. In contrast, twoing favours splits that tend to make the two descendant nodes as pure as possible i n the two superclasses. "It gives strategic splits and informs the user of class similarities . . . . [ It] attempts to group together large numbers of classes that are similar in some characteristic [near the top of the tree] . . . [and] attempts to isolate single classes [near the bottom of the tree] ." (ibid, p 105) . The twoing criterion is in fact the same as the delta criterion used in THAID (see Morgan and Messenger, 1 973). CART is a binary recursive partitioning algorithm that sends a case either left or right. Univariate splits using both ordered and categorical variables can be carried out as well as linear combination splits for ordered or quantitative variables only. Both priors and misclassification costs can be varied and incorporated into the CART tree building process. Priors can be varied to take account of samples that are not representative of the populations from which they came, that is, the c lass sample proportions are different from the class proportions in the population. Misclassification costs can also be varied to take into account instances whereby it is more serious to misclassify some class(es) than other class(es). At a node t, the - Stopping Rules estimated probability of misclassification using the G ini index lS ? C(ilj) p(i/t) p(j/t). I ' 1 The process is terminated only when all nodes that have not yet been split are pure, or if the node size for all unsplit nodes falls below a specified value. - Node Classification/Prediction Tree Pruning A node is assigned to the class with the largest number of observations in that node, in the case of priors proportional to sample size and unit costs. If either priors or costs are varied, then classification of a node must also take into account the values of the costs of misclassification and class priors. It is clear that CART' s stopping rule produces a tree that could be very large, with an overly optimistic error rate and many of the splits near the bottom of the tree occurring only because of noise in the data. To guard against this possibility, CART employs a backwards recursive node recombination or pruning algorithm on the completed tree. The algorithm proceeds as follows. For any sub tree T of T max ? where T max is the fully grown tree, define the cost-complexity measure, Ra(T) as Ra(T) = R(AT) + a L(T) where R(AT) is the resubstitution estimate of the accuracy of the sub tree, L(T) is the number of terminal nodes in T and a ? 0 is the complexity parameter. For each value of a find the subtree T a that minimises Ra(T) above. In practice, each successive pair of descendant nodes is recombined and an estimate of accuracy is compared for the split/unsplit situations using the cost-complexity function above. A larger penalty is assigned to a l arger sized tree. If there is no improvement in accuracy then the two descendant nodes are recombined and tested for accuracy in the same manner. If there is some improvement from splitting, this subtree is retained and the process continues trying to find the next smallest subtree which produces an increase in accuracy through splitting. The end result is a sequence of trees with decreasing size and increasing resubstitution error rate. A ' honest' sized tree can be obtained by either running an independent test sample down this sequence of subtrees and selecting the tree having the minimum error rate or using g-fold cross-validation, 2 ? g ? n. With cross-validation in the CART 43 Validation Procedures Interactive Ability Graphical Ability One-Stage Optimality? Missing Values 44 context, a tree of maximum size is grown in each of the sets of size (n - (n/g)) and the pruning algorithm is carried out as for the original sample. Then, each of the g sets of omitted observations (size = n/g) can be used as an independent test sample for the sequence of subtrees created by the pruning algorithm. The sizes of each of the g trees are averaged and the tree from the learning sample that is c losest in size to the average of the chosen cross-validated trees is then selected. Breiman et al recommend using a measure of error as well, whereby se(R(i)) = (R(i) ( 1 - R(i))ln] l/2 is the standard error estimate for the test sample or cross? validated error rate R(T). The idea is to choose the smallest tree within one standard error of R(t). This was recommended to reduce the size of the decision tree created as it was found that independent error rate estimates were fairly constant over quite a wide range of tree sizes. As seen above, test sample validation and cross-validation are used by CART to select the "right-sized" tree. These two techniques are also used to estimate the true error rate of the prediction rules obtained . CART by Systat provides an enhanced version of CART that is more user-friendly than the original. There is some interaction with a menu system, however a whole tree must be grown before any alterations can be made. CART, though, does not allow you to change the splitting variable at an intermediate stage of the tree-growing process. CART by Systat produces files that can be displayed by an independent graphics program , after the CART analysis has been carried out. This is not possible in the original CART. Yes. Missing values are handled by what Breiman et al call surrogate splits. This is defined as follows. Suppose that s * is the optimal partition of a node t into tL and tR. If a split, sj, is carried out on a variable, xi , then the probability that Sj sends the cases in t the * . same way as s IS Criticisms p(s*, s_;) = PLL(s*, sj) + PRR(s*, sj) where and t' Lis the set of observations sent left by sj . A surrogate split, Sj on xj, occurs if Breiman et al detine a surrogate split as the split on Xj that most accurately predicts the action of s*. This then leads to the use of surrogate splits with m1ssmg values. If a case has a missing value for the splitting variable, so that s * is not defined for that case, then for all the non? missing variables for that case, find the best surrogate split, S'j, and split the case using s_i . CART could be considered as a 'watershed ' in the development of tree-based methods in that it veered away from the direct stopping rules of previous decision tree-based methods and adopted the approach of 'grow an overly large tree, then prune and validate ' . In taking this approach it set a benchmark for future methods to build on, as well as being open to criticism . Loh and Vanichsetakul ( 1988) criticise CART on the following bases: (i) Based on sort and search principles. (ii) Typically no more accurate than LDA. (iii) Too slow if cross-validation is employed. (iv) Uses only binary splits. (v) The cross-validation estimate of error is not genuine as it was also used to select the tree size. (vi) Produces different results when the variables are transformed. Breiman and Friedman ( 1988) answered all these criticisms as well as crit ic is in g the FACT program of Loh and V anichesetakul. Quinlan ( 1987) criticises the pruning algorithm employed by CART. First, he believes that there is no valid reason why the cost-complexity model should be favoured over any other. Second, he does not know why the sequence of subtrees produced should be abandoned after selection of the best tree. 45 Examples in the Literature 46 Lastly, he feels that the use of c ross- validation IS computationally expensive. In an empirical s tudy using a wide variety of data sets, comparing a number of tree-based as well as statistical algorithms and neural networks, Feng et al ( 1 993) found that CART performed rather well over all the data sets. They also found that CART tended to produce the smal lest sized trees, hence the simplest trees. Also on the positive side, " . . . [the introduction otl a cost-handling mechanism in the testing phase (in CART) can make a visible improvement compared to, say, C4.5". (ibid, p 5 1 ) . Feng et al found that CART produced results much closer to those produced by traditional statistical algorithms than other tree-based methods. They suggested that this could be due to the fact that CART incorporates a cost structure. On the negative side, they found that CART's pruning algorithm was not all that efficient and wrongly assumed that there was a single global parameter for the amount of pruning to be done. As well , they found that CART can prune too heavily if the one standard error rule is used and there is very little noise in the data. Using the zero standard error rule, however, can mean that trees are too large if there is noise in the data. Grajski, K A, Breiman, L, Viano Di Prisco, G and Freeman, W J ( i 986). Classification of EEG spatial patterns with tree? structured methodology: CART, IEEE Transactions on B iomedical Engineering, 33, pp 1076- 1 086. Ildiko, E F and Lanteri, S ( 1 989) . Classification models: discriminant analysis, SIMCA, CART, Chemometrics and Intelligent Laboratory Systems, 5, pp 247-256. Crawford, S L and Souders. S K ( 1 990). A comparison of two conceptual clustering algorithms, International Journal of Pattern Recognition and Artificial Intelligence, 4, pp 409-420. 3.8 C4.5 Author(s) J R Quinlan (Australia). Introduction Published in 1 986, C4.5 is a descendant of ID3 (Quinlan, 1 979). It is described by Quinlan et al ( 1 986), p 1 57 , as ". . . a new inductive inference tool that is capable of dealing with large volumes of messy, real-world data". Unlike ID3 , though, the tree-growing process is followed by a number of pruning procedures. In addition, a larger range of options and parameter settings is available in C4.5 than ID3. Classification/Regression Handles classification problems only. Tree Growth: The first stage of the process is very similar to ID3. According to Quinlan et al ( 1 986) , a subset (approximately 1 0% ) of the learning sample is chosen at random. This subset is known as a working set. A decision tree is grown on the working set. The remaining 90% or so of cases in the learning sample are classified using this decision tree. If all the observations from the learning sample are correctly classified, then the process stops and the decision tree is satisfactory. Otherwise, another set of observations from the learning sample is added to the working set and a completely new tree is grown. - Splitting Method The gain criterion, as used by ID3, can also be used in C4.5. An alternative is the gain ratio criterion. If a variable, xj , has v distinct values then v possible descendant nodes can be found from splitting on xi . The information measure or 'correctness of the answer' , IV(xi) from splitting on xj is found by V IV(x?) = - I. mi + ni log mi + ni .1 ? 1 m + n 2 m + n I= where m and n are number of observations from class 1 and class 2 respectively, while mi and ni are the number of class 1 and class 2 observations in the ith node. Let the expected information content from a split on Xj be defined as v m? + n? E(x?) = I. 1 1 IV(x?) . .1 i= l m + n J 47 48 - Type of Splits - Priors/Costs - Stopping Rules The gain in splitting on xj is: The gain ratio criterion chooses the variable with the maximum ratio of gain(xj)/IV(xj) to split on, subject to a number of minor constants. Multiway splitting, as in ID3 , is carried out. Splitting is done on every distinct value of the splitting variable. Only univariate splits are carried out. No. The process is halted, when after a few iterations of the tree? growing process, no decrease in misclassification error rate has been observed. This would usually occur if there were inconsistencies in the learning sample. - Node Classification/Prediction Tree Pruning A node is assigned to the class with the maximum number of observations in that node. In contrast with Breiman et al ( 1 984), C4.5 uses pessimistic prun ing, to decide tree size. Quinlan ( 1 987), defines the method as follows. Let T be a subtree of the tree T max , containing L(T) terminal nodes and letting IK and IJ be the total number of observations and number of misclassified observations respectively in subtree T. A pessimistic view of T is that it will misclassify L= CIJ + L(T)/2) out of the IK unseen cases, with standard error (L) _ ?LCIK - L) se - I X . The above involves using the continuity correction as in binomial probabilities. Let E be the number of observations misclassified by the best terminal node within T. The pessimistic pruning algorithm replaces T by the best terminal node whenever L = IJ + L(T)/2 is within the limits of L + se(L). A number of repetitions of the tree growing and pruning process is carried out. As the i ni tial working set is selected purely at random, the same learning sample can give rise to completely different trees, as completely different parts of the learning sample may be chosen. The pruning process selects the best trees based on a combination of l ow misdassification error and small tree size. Quinlan ( 1987) states the following as the two prime advantages of pessimistic pruning. Validation Procedures Interactive Ability Graphical Ability One-Stage Optimality? Missing Values Criticisms (i) Faster than other pruning methods. (ii) Does not need a test sample distinct from the learning sample. The validity of the second advantage is questionable as some form of bootstrapping or cross-validation could be used in the prumng process. Validation of the rules is carried out after the creation of the decision trees. Validation by an independent test sample is done by dividing the data into a test sample and a learning sample before analysis. No. No. Yes. Either omit observations with missing values from the analysis or use class sample sizes multiplied by some parameter to estimate missing values. Recent studies by Feng et al ( 1 993) and Schwartz et al ( 1993) have compared C4. 5 with other classification methods. Schwartz et al found that C4.5 was very robust to noise in the data, produced sensibly sized trees and provided new insight into a particular set of data by uncovering important relationships among the variables. C4.5 , however, due most probably to its creation in the machine learning environment, has no mechanism for incorporating a cost structure. Schwartz et al note that C4 .5 produced large d ifferences in group misclassification error rates. Feng et al ( 1993) carried out a more thorough study than Schwartz e t al . Their results tentatively showed that C4.5 produced larger trees than CART, hence produced rules that were biased towards the learning sample. They note " [r]eliability is negatively related to the difference between [ learning] and testing accuracy" (Feng et al, 1 993, p 48) . As a tree increases in size, node sample s izes decrease so rules are being generated from smaller and smaller sized samples. This makes these rules less reliable. They conclude by saying "[as] our tests show, even simply introduc[ing] a cost-handling mechanism in the testing phase (in CART) can make a visible improvement compared to, say, C4.5. 49 Examples in the Literature Quinlan, J R, Compton, P J, Horn, K A and Lazarus, L ( 1 986). Inductive knowledge acquisition: a case study, in "Applications of Expert Systems, Volume 1", J R Quinlan (ed) , Wokingham: Addison-Wesley, pp 1 57- 1 83 . Schaffer, C ( 1 993) . Selecting a classification method by cross? validation, Personal Communication. 3.9 FACT 50 Author(s) W Y Loh (USA) and N Vanichsetakul (Thailand). Introduction Published by Loh and Vanichsetakul in 1988 at the University of Wisconsin, the ful l title of the program is Fast Algorithm for Classification Trees. The goal of the procedure is an algorithm sharing the best features of LOA and CART, namely the speed of linear techniques and the readily comprehendable structure of decision trees. Classification/Regression FACT deals with categorical dependent variables only. Tree Growth: The tree is grown on the whole data set. - Splitting Method Three splitting algorithms are used by FACT. The first deals with univariate splits . Univariate F-ratios for variable selection are used at each node, to obtain the variable with the highest F? ratio for splitting, and then carrying out LDA on the selected variable to partition the co-ordinate axis. If the largest F-ratio of between to within-class variance is less than a specified threshold, Fo. no split is formed and the node is declared terminal. Linear combination splits can also be generated by FACT using principal component analysis of the correlation matrix at each node. Then, LDA is carried out on the scores of the m largest principal components with m depending on user input. Loh and Vanichsetaku1 ( 1 988) prefer linear combination splits over univariate splits . A third method of spl itting can be used whenever spherical symmetry is detected in a node, whereby univariate and linear combinations would he ineffective. Polar coordinate splits solve this problem, which involves transforming the best - Type of Splits - Priors/Costs - Stopping Rules splitting variable x? after subtracting the mean Yi from each observation Yii ? and splitting on the resulting transformation, where Yij is the ith principal component score for the j th observation. Multiway splitting can be used by FACT, d ividing a node into two, three or more descendant nodes. As mentioned previously, splitting can be canied out using only one variable at a time, or a linear corn bination of the variables. As FACT uses LOA to split each node, both different priors and cost matrices can be incorporated into the tree building process. A direct stopping rule is used to determine tree size. Spli tting is stopped if the error rate found from resubstituting the original sample does not decrease with splitting or the node size falls below a certain value. - Node Classification/Prediction Tree Pruning Validation Procedures Interactive Ability Graphical Ability One-Stage Optimality? Missing V a lues Criticisms A node is assigned to the class with the largest number of observations in that node, unless p riors and/or costs are altered. No. The final decision tree generated in FACT can be val idated by g-fold cross-validation, 2 :::; g :::; 25 . No. Draws trees using Splus functions. Yes. Missing values are replaced by class means estimated from non? missing values in the learning sample. The principal criticisms of FACT appeared in Breiman and Friedman ( 1988) . They were: The authors of CART, in Breiman and Friedman ( 1988), regard FACT as a step back in the evolution of binary dec ision trees. B reiman and Friedman criticise FACT on the fol lowing grounds. 5 1 (i) The principal motivation for FACT is computational. Running time is sacrificed for accuracy and simplicity. (ii) Linear corn bination splits are not better than univariate splits. In most cases where recursive partitioning has performed better than tradi tional parametric methods it has been through univariate splits. (iii) Top-down stopping rules, as used in AID, THAID, etc, were one of the main reasons why the above methods were not really recognised. "The optimal-complexity tree pruning algorithm (based on cross-validatory choice) implemented in CART is probably the most important contribution of B reiman et al ( 1 984)." (Breiman and Friedman, 1988, p 726). (iv) FACT cannot handle categorical variables in a clean and elegant way. (v) FACT is not invariant under transformations of variables. (vi) There are no surrogate variables to handle missing values. Examples in the Literature Wolberg, W H, Tanner, M A, Loh , W Y and Vanichsetakul, N ( 1987). Statistical approach to fine needle aspiration diagnosis of breast masses, Acta Cytologica, 31, pp 7 3 1 -74 1 . 3. 10 KnowledgeSeeker 52 Author(s) B de Ville, E Suen and D B iggs (Canada) . Introduction Released commercially in 1 989, KnowledgeSeeker is a decision tree package that according to its authors, " . . . m ine[s] a database for i ts cri t ical dec is ion-making and p roblem -solving information" (de Ville, 1 990, p 30). The results are presented in a graphical display with an easy-to-use interactive ability which " . . . provides both end users and specialists with high levels of interaction of accurate, i l luminating and reliable decision? making information and knowledge based rules" (!bid, p 30). Classification/Regression Both numeric and categorical response variables can be handled. For comparison with the other methods and simplicity, everything henceforth will be described in the classification context only. Tree Growth : - Splitting Method - Type of Splits - Costs/Priors - Stopping Rules The tree is grown on the whole data set. KnowledgeSeeker seeks to overcome the problems inherent in AID and ID3 by using a significance testing approach to splitting as used in the CHAID program (Kass, 1 980) . Two alternatives exist for splitting using the significance testing approach. The first uses exhaustive partitioning, as in CART, searching over all possible combinations of values of every variable to find the split which maximises the x2 statistic with respect to the class variable. This method is guaranteed to find the optimal split for the data at hand based on statistical inference. The second approach is to use a heuristic clustering technique. Values of a particular variable are grouped with one another on the basis of their similarity in the response variable. This merging of values continues until no further merging is significant at a specified level of significance. Once values are merged, they c an be split again using a more stringent level of significance. This approach is not optimal, but de Ville ( 1 990) regards it as intuitive and appealing. Since the most alike values of a particular predictor are clustered together, multiple branches can accrue from the same node. Tha t i s , mul tiway partition ing is used by KnowledgeSeeker. Splits are carried out on only one variable at a time. No. Splitting is stopped if either: (i) Node size falls below a certain value. (ii) The optimal split on a predictor at a particular node does not exceed a specified significance level. - Node Classification/Prediction Tree Pruning A node is assigned to the class with the largest number of observations in that node. According to de Ville ( 1 990), KnowledgeSeeker supports both validation and tree pruning methods, that either verify the decision tree or which rate the quality of new branches on the decision tree and truncate them if its quality fails to pass a certain threshold valve. In the literature of decision tree methods, the above is NOT a pruning method, but rather a set of tests used in the tree growing process. Thus KnowledgeSeeker, like CHAID, i ts closest ancestor, does not have a tree pruning method. 53 Validation Procedures Interactive Ability Graphical Ability One-Stage Optimality? Missing Values 54 Version 2.0 of KnowledgeSeeker does incorporate a validation procedure whereby part of the data is used to grow the tree while the other part of the data is used to test the rules created. It is relatively quick and easy to divide the data in two and use each half in turn as a learning sample and a test sample. Unfortunately, the latest version (2. 1 ) of KnowledgeSeeker does not contain any validation procedure. With the click of a mouse button, the user can investigate other possible partitionings of the decision tree at every step of the tree growing process . KnowledgeSeeker automatically calculates the best alternate splits at every node, so the user can examine the effects on the tree by changing from the best split to the best alternate/second best alternate split etc, so as to " . . . correspondingly mould the creation of the decision tree or rule base to support their understanding of the problem area and decision making task at hand" (de Vil le, 1 990, p 30). In addition, the user has the choice of either growing the tree automatically or growing it on a node-by-node (stepwise) basis. All operations are started by the use of a pull-down menu. The graphical display of the tree works in unison with the tree building process. As a node is split into a number of sub-nodes, the results are displayed immediately on the screen by means of a decision tree. The user can interrupt the process to investigate the current state of the tree and then continue at the point where splitting was ceased. The level of detail about each node and each split carried out can also be altered. KnowledgeSeeker is, in theory, one-stage optimal . In practice, however, it is relatively quick and easy to investigate the effects on the tree of changing the partition of a particular node to one of the other significant partitions. KnowledgeSeeker handles m1ssmg values in two different ways: (i) They are exc luded from the decision tree growing process. (ii) They are treated as an additional category of a variable, and so can be combined with the categories that they most resemble. Criticisms Examples in the Literature 3. 1 1 Splus Trees ( ) Very few, if any, reviews of KnowledgeSeeker have appeared in the literature. Biggs et al ( 199 1 ) conducted some simulation studies with different significance levels. They found that KnowledgeSeeker can confidently be used with either smal l or large data sets involving categorical predictors and a response. They also found that the same confidence applied with cont inuous responses, provided that the response was ap?roximately normally dis tribu ted with roughly equal vanances. One major criticism that could be made of KnowledgeSeeker is that i t moves away from any form of validation of the results, by independent test samples. As Breiman et al ( 1984) state, the use of the resuhstitution estimate of error rate as an estimate of the true error rate can give an overly optimistic picture of the set of decision rules constructed. The omission of a validation procedure goes against current statistical practice in the decision tree field and even contradicts what was written in de V ille ( 1 990). In a personal communication, de Ville ( 1994) affirms that the forthcoming version of KnowledgeSeeker does support a hold hack sample and validation facility. Author(s) L Clark (USA) and D Pregihon (Canada). Introduction Developed in 1 99 1 using the Splus language to c arry out a CART-like decision tree modell ing method. Of all the decision tree-based methods, the Splus tree routines are the closest to those used by CART with binary recursive partitioning, pruning and cross-validation. Classification/Regression Splus trees() handles both categorical and numeric dependent variables hence can he used for classification and regression problems. For comparison with the other methods and simplicity, everything henceforth will be described in the classification context only. 55 56 Tree Growth: - Splitting Method - Type of Splits - Priors/Costs - Stopping Rules The tree is grown on the whole data set. The deviance function for an observation Yi is defined as (Clark and Pregibon, 1992). k D(?i, Yi) = -2 _2: Yij log(Ytj) j= l that is, negative two times the log-likelihood function, where 'Yij denotes the probability that the ith response falls in j th class. At a given node, the mean parameter ? is constant for all observations. The maximum likelihood (or minimum deviance) estimate of ? is given by the node proportions. The deviance of a node is defined as the sum of the deviances of all the observations in the node, 0(11; y) = I D(P.; Yi) ? A node where all the observations belong to the same class will have a deviance of zero. Splitting is achieved by comparing the deviance of the current node to that of the two descendant nodes, where the combined deviance of the two descendant nodes is and the split that maximises is the split used at the given node. Only binary splits are used, and only on one variable at a time. No. Splitting is stopped by one or two different rules. The first sets a minimum node s ize below which splitting cannot be done while the second stops splitting if the ratio of deviances between a tree with r terminal nodes and the root node is less than some threshold value. - Node Classification/Prediction A node is assigned to the class with the largest number of observations in that node. Tree Pruning Validation Procedures As with CART, an overly large tree, biased towards the learning sample, can be grown by Splus trees(). The next step is to apply a pruning procedure that determines a nested sequence of subtrees of the original tree by cutting off branches containing relatively unimportant splits. This is achieved by means of a cost-complexity measure Da(T) = D(T) + a L(T) where Da(T) is deviance of the subtree T, s1ze L(T) is the number of terminal nodes contained in T and a is the cost? complexity parameter. By default, the procedure produces a sequence of subtrees that min imise the cost-complexity measure. Note that this algorithm is very similar to the one used by CART, except that CART uses a measure of misclassification en?or rate rather than deviance. A similar procedure used by Splus trees() is the shrink-tree() function which determines a sequence of subtrees from the original tree that differ in their fitted values. The function uses the recursion relation y (node) = a(y(node)) + ( l -a) y(parent) where y(node) is the usual fitted value for each node and y(parent) is the shrunken fitted value for the node's parent, which was in turn obtained in the same way. The technique basically uses a parameterization of a that optimally shrinks the descendant nodes to their parent nodes based on the magnitude of the difference between y(node) and y(parent). The result of a plot of deviance against size of the subtrees found by shrinking is a smooth decreasing curve which t1attens out as the size of the subtrees increase. A second approach Splus trees() uses to test the sequence of subtrees produced by either pruning or shrinking, is to use g? fold cross-validation, 2 ? g ? n , to select the tree with the minimum cross-validated deviance rather than the m inimum error rate. No standard error rule is used by Splus trees() . Thus, like CART, the decision tree that is produced should be relatively robust, giving a set of rules that are valid when applied to another set of data from the same population. 57 Interactive Ability Graphical Ability One-Stage Optimality? Missing Values Criticisms 58 Splus trees() has an option to allow the user to examine the goodness of split for each variable at a particular node. This information is conveyed by means of a scatter plot for categorical variables and a high density bar graph for numeric predictors. The user is able to see what variable (and value(s) of that variable) is the best discriminator of the class variable. In order to change the splitting variable, however, from that chosen by the spl itting algorithm, . the edit.tree() function is called with the variable to be split and its value explicitly stated. Trees can be drawn and labelled through Splus graphics. Either the use of a mouse or Splus commands can edit the tree, examine splits and examine the distribution of the c lasses in the terminal nodes. The method is one-stage optimal examining only the best splits at a current node. As seen above, though, the splitting variable at a particular node can easily be changed but a new tree cannot be grown after changing the splitting variable. The function na. tree.replace() is used to handle missing values in Splus trees() . The function creates a new level for any variable containing missing values, coded as 'NA' . Numeric predictors are first grouped into c categories. Clark and Pregibon ( 1 992) describe how missing values are predicted as follows: The approach we adopt is that once an NA is detected while dropping a (new) observation down a fitted tree, the observation 'stops' at that point where the observation is required to continue the path down the tree. This is equivalent to sending an observation down both sides of any split requiring the missing value and taking the weighted average of the vector of predictions in the resulting set of terminal nodes. Many of the criticismS levelled at CART would also apply to Splus trees(). However, "([ t] he S computing language . . . is currently one of the most developed interactive programming environments for data analysis and graphics". (Le B lanc and Crowley, 1993, p 466). The interactive faci lities that Splus trees() has appears to give it a distinct advantage over CART. As Clark and Pregibon ( 1 992), p 4 15 state "[o]ur recommended approach to tree building is far less automatic than that provided by other software for the same purpose, as the unbundl ing of procedures for growing, d isplaying and challenging trees requires user in itiation in all phases". Perhaps one other Examples in the Literature 3. 12 IND criticism of Splus trees() is that the method requires an adequate knowledge of the Splus language, which is not menu-driven nor user-friendly. Bradford, E ( 1 993) . Tree-based models in S, New Zealand Statistician, 28, pp 36-5 L Morton, S C ( 1 992). Personal crunching : new advances in statistical dendrology, Chance, 5, pp 76-79. Author(s) W Buntine (Australia). Introduction Introduced in 1 99 1 , the technique tries to combine the simplicity of decision tree rules with the power of Bayesian methods. Bayesian methods are used for splitting, smooth ing and tree averaging. "IND provides a potentially bewildering number of options to allow the user to precisely control how data is interpreted, how trees are grown and tested, and how results are d isplayed" (Buntine and Caruana, 1 993 , p 1 -4). As well, IND has the ability to simulate the CART and C4.5 tree? based methods, or follow a minimum message length idea such as that used in Wallace and Patrick ( 1 993), or indeed the newly developed decision graph approach of Oliver ( 1993). Classification/Regression Used only for classification. Tree Growth: The tree is grown on the whole data set. - Splitting Method IND can choose from a number of different criteria when evaluating the quality of different splits or tests. For example the Gini and twoing splitting criterion (see Section 3 .7) can be used or the gain ratio criterion, as used with C4.5 (see Section 3 .8) . Buntine recommends the use of Bayesian splitting which evaluates each possible split of a particular node into several sub-nodes. The Bayesian estimate of the posterior probabil i ty of each possible spl i t being correct is then evaluated, with the split producing the maximum posterior probability being carried out. 59 60 - Type of Splits - Priors/Costs - Stopping Rules Multiway splitting is used, splitting on each distinct value of a variable. Only univariate splits are carried out. In the Bayesian mode, "class priors" in the CART sense of the word, do not exist. The algorithm incorporates p riors but these are Bayesian prior probabilities of the class probabilities in each terminal node prior to seeing any data. IND has cost structures. As the technique is Bayesian, the i ncorporation of priors is trivial. The tree returns a c lass probability vector. The cost vector is then combined with that so that a minimum cost decision can be made. It' s a simple add-on to the tree interpretation routine. Splitting stops when the quality measure above for the best split at a particular node fails to exceed a prespecified criterion. - Node Classification/Prediction Tree Pruning Validation Procedures If classification is not determined by cost, then each terminal node is assigned to the class with the maximum probability within the terminal node. If costs are incorporated into the tree growing process then the terminal node is assigned to the minimum cost class. IND adopts a Bayesian approach to pruning, which uses a smoothing technique. The usual approach is to find the class probabilities of observations in the terminal nodes. The smoothing approach also takes into account the class probability vectors for all the intermediate (decision) nodes en route to the terminal nodes. The resultant final tree may have widely differing class probabilities for the terminal nodes. If the c lass probabilities of two or more terminal nodes are sim ilar, and come from the same parent, they can be pruned upwards. One has to divide the data into a learning samp le and a test sample before the tree growing process is begun. A routine exists in IND to carry this out. In a personal communication, Buntine ( 1 993) recommends growing the tree on the full data set and believes that if you want to produce the best predictions possible, all of the data should be used to grow the tree, as per standard Bayesian theory. As several different class probability trees are grown for each data set, the weighted average of the class probability vectors each tree assigns to an observation can be taken. This is the B ayesian averaging approach which Buntine favours instead of learning/test samples. Interactive Ability Graphical Ability One-Stage Optimali ty? Missing Values Criticisms Examples in the Literature Advanced features allow the user to interactively search for the best spli ts and control the tree-growing process. IND contains a graphical display routine that is used to display the tree classifiers in various forms. No. The method encompasses an N-ply lookahead facility, whereby not only are the best splits examined at the current node, but also how the resulting descendant nodes and their sons should be split? For example, a 2-ply lookahead scheme would search for the best split at the current node as well as the best split at the resulting son nodes. This may lead the user to find that the so-called 'best' split of the current node was not the optimal split in terms of future tree development, as this variable may not interact with any other variables in the data set. A ' lesser' split may in fact lead to a greater reduction in misclassification in the next stage of tree development. The N? ply lookahead facility allows the user to uncover such a structure in the data. Missing values are handled using Quinlan' s preferred strategy. That is, IND sends a case down each branch with the proportion found in the learning sample at that node. In effect, each case with missing variables is split into a number of parts, with the largest part going down the branch where most other cases have gone. Otherwise, a routine exists to send a case down the branch of the tree most commonly taken by other examples. IND is a relatively new method so there has yet to be an article either criticising or praising IND in the literature. It would perhaps be criticised for having no mechanism to incorporate "class priors" in the CART sense of the word. Buntine ( 1 993) feels that the main criticism of IND is that he hasn ' t optimised to handle all those important real-world things like real-valued splits, which can be done, but without much thought. 61 3. 13 SUMMARY TABLE COMPARING THE TEN TREE-BASED METHODS Attribute AID THAID ID3 CH AID CART ' Author(s) 1 N Morgan 1 N Morgan 1 R Quintan G V Kass L Breiman 1 A Sonquist R C Messenger J H Friedman R A Olshen and C J Stone Introduction 1 963 1973 1 979 1980 1 984 Classification/Regression Regression Classification Classification Classification Both Tree Growth Uses all the data. Uses al l the data. Uses a subsample of U1e Uses all the data. Uses eiU1cr all or a I data. subsample of the data. Maximising the x2 Gini or twoing splitting ? Splitting Method Maximum between to Theta or delta splitting Maximisation of the gain criterion. within-group variance. criterion. criterion. statistic of grouped categories. ? Type of Splits Binary/US. B inary/US. Multiway/US. Multiway!US. B inary/US or LC. ? Costs/Priors No. No. No. No. Yes. ? Stopping Rules Direct stopping. Direct stopping. All cases in the learning No significant splits. All nodes are pure to one sample are correctly class. classified. ? Node Cla?slfication!Predlction Average value of the Class wiU1 largest number As for THAID. As for THAID. As for THAID after cases in the tenninal of observations in the accounting for costs and node. node. priors. Tree Pruning No No No No Cost-Complexity Validation Procedures No No No No Yes Interactive Ability No No No Yes Yes/No Graphical Ability No No No Yes Yes One-Stage Optimal? Yes Yes Yes Yes Yes Missing Values Estimated using class Estimated using class Estimated using Omitted Estimated using means. means. class proportions surrogate splits Criticisms ? Produces overly large ? Produces overly large ? Does not know when to ? No validation of the ? Instability of cost- trees . trees. stop. results. complexity pruning. ? Too dependent on sizes ? Too dependent on sizes ? No validation of the ? Tree size affected too of stopping rules. of stopping rules. results . much by the standard ? No validation of ? No validation of error rule. results. results. TABLE 3. 13 (cont'dJ Attribute C4.5 FACT Knowledge Splus Tree(s) IND Seeker Author(s) J R Quintan W Y Loh I3 de Vil le L Clark W Duntine N Vanichsetakul E Suen D Prcgibon D Diggs Introduction 1 986 1988 1 989 1 99 1 199 1 Classification/ Classification Classification Doth DoU1 Classification Regression Tree Growth Uses a sub-sample of the Uses all the data. Uses all the data. Uses all U1c data. Uses all U1e data. data. ? Splitting Method Maximisation of the gain Discrimincmt analysis. Maximising the x2 Likelihood ratio statistic. Quality measure. ratio criterion. statistic of grouped categories. ? Type of Splits Multiway!US. Multi way/US or LC. Multi way/US . Dinary!US . D inary!US. ? Costs/Priors No. Yes. No. No. Yes/No. ? Stopping.Rules Direct stopping. Direct stopping. No significant splits. Deviance below a certain Quality measure below a value. certain value. ? Node Classification/Prediction As for TIIAID. As for CART. As for THAID. As for THAID. Maximum class probabilities after accountin? for cost. Tree Pruning Pessimistic No No Cost-complexity Dayesian with deviances Validation Yes Yes No Yes Yes Procedures Interactive No No Yes Yes Yes Ability Graphical No Yes Yes Yes Yes Ability One-Stage Yes Yes Yes/No Yes No Optimal? Missing Estimated using Estimated using Creation of a Creation of a Estimated using class Values class proportions class means new cate?ory new catc?ory proportions Criticisms ? No mechanism for ? Decision rules not ? No validation of the ? Instability of cost- ? No mechanism for incorporating a cost simple and accurate. results. complexity pruning. incorporating priors structure. ? Direct stopping rules arc structure ? Tends to produce overly used. ? Not designed for real- large trees. ? Not robust to non- valued splits. nonnality. 64 4. SIMULATION STUDIES INVOLVING CONTINUOUS DATA 4. 1 INTRODUCTION Data sets involving distinct groups of populations arise in many disciplines, including the socia l sciences, business and medicine. Multivariate data sets are often not easy to analyse, so it is important that the method should be both powerful and easy to understand. In the c lassification context, the prediction rule should be accurate, that is have a low, unbiased error rate, yet be as easy to interpret as possible. In Section 4.2, through a l i terature survey, an investigation is carried out into the various types of error rate estimates that are used in the field of c lassification. Next, in Sections 4.3 and 4.4, a comparison of four c lassification methods from the domains of traditional discrimination and tree-based methods is done. LOA and QDA are the two most commonly used classification methods in statistics to handle the above type of data. These two parametric techniques are compared and contrasted with two tree-based methods, CART and FACT. (See Sections 2.2, 2 .3 , 3.7 and 3 .9 for details on LDA, QDA, CART and FACT respectively.) This chapter compares both the accuracy and reliability of these four classification methods in classifying individuals into two multivariate populations under certain combinations of parameters. Three types of data distributions will be investigated, involving normal, lognormal and standardised lognormal. The robustness of each method to a change in the value of the a priori probabilities of class membership will also be determined. 4.2 ERROR RATES A great deal has been written on the subject of the en?or rates in classification analysis over the past three or more decades. In this section , a review of the literature on error rates is given as well as a formal definition of each of the error rate estimators, including those to be used later in this chapter. Extensive reviews of error rate estimation may be found in Kanal ( 1974), Toussaint ( 1 974), Lachenbruch ( 1975 ), Efron ( 1982, 1 983) , Hand ( 1 986) and McLachlan ( 1986, 1987) among others. 65 66 In any classification problem, the object i s to assign a random observation x to one of k populations, II I > . . . , ilk? Paraphrasing Toussaint ( 1 97 4 ) , one of the most important problems with the use of any classification method is estimating the probability of misclassification. The optimal error rate of any classifier is the Bayes error rate, which is defined as R(B) = 1 - J maxi [fi(x) nj] dx (4.2. 1 ) where fi(x) is the class conditional density function for IIi and 1ti i s the a priori probability of belonging to IIi . According to Hand ( 1 986), p 335, "[this] is the minimum possible error rate given a set of [variables] ." This is the error rate that would result if the class conditional density functions were known. The actual or true error rate, R(T), is defined as the expected probability of misclassification when the class conditional density functions are known and the Bayes rule is not used. In practice, this is the error rate that would result when applying the classifier to an infinite test sample. In notation form, and assuming that n 1 = n2 R(T) = i J.. , f1 (x) dx + i J... , f2(x) dx (4.2.2) The probability that X falls in A I given that X ? n2 is (4.2.3) where D(x) is as defmed in (2.2 .5) and It follows that Pr(D(x) < 0 I Il1) = - o-o QDA o- o- a QDA 5 5 98 The second analysis compared LOA, QDA, CART and FACT using the ratio of the two group error rates. The resul ts of the split-plot ANOV A show that as in the previous analysis, the R main effect was by far the most important effect (F = 1 64. 1 2) wi th the R * f(.) in teraction (F = 7 1 .8) also being important . As with the previous analysis, the R * f(.) * e (F = 2 1 . 1 4) interaction was significant so results will be presented in terms o f these three factors. Analysis of the residuals shows that there was rather a funnel-like pattern among them, implying that the variation of residuals was not constant. Increasing the fitted value increased the variation of the residuals. Box plots of the residuals for each method showed, as in the previous analysis, that the mean square enor for FACT was much larger than that for the other three methods. Figures 4.2a and 4.2b give the mean ratios of the group error rates using the adjustment factor mentioned earlier. The ideal situation in this case is where R(l/2)/R(2/ l ) = 1 . It is noticeable that the trends observed are similar to. yet somewhat different from those in Figures 4. 1 a and 4. 1 b. Figure 4.2a shows that in the case of f( . ) being normally distributed both the LOA and QOA error rates were the most stable of all the methods over the g iven priors-covariance levels. CART performed very badly in the unequal priors case while the results for FACT when e = 3 and e = 4 were omitted from the graph due to the excessively high values recorded. Figure 4.2b shows that in the case of lognormal data. QDA was the best method. It must be remembered. as noted in Section 4.3. 1 . the va1iance for the second class where x2 > 0 will be substantially larger than that for the first class where x 1 = 0. Therefore, the quadratic rules should be expected to work better than the linear rules. LOA has perf01med badly again but as in Figures 4. l a and 4. l b, LOA has performed consistently poorly, whereas the ratio of priors for CART was affected mostly by sample size. The results for FACT were not really worth quoting due to the excessively high ratios of en?or rates. In the third analysis. it was decided to compare the mean difference between both the cross? validation and apparent group error rates using CART and LDA, but this time using equal priors. The results . from the method stratum again show the R main effect (F = 222.46) to be most important with the R * f(.) interaction (F = 1 30.8) also being very important. The fourth analysis compared the CART and LOA apparent and cross-validation group error rates using the ratio of group error rates as the measure of performance. The results fro m the ANOV A show rather different trends to those exhibited in the previous analysis with the m agnitude of the F-ratios h aving dramatically decreased. For instance, for the R m ain effect, the F-ratio was only 2.43, which not significant at even the 5% level of significance. This indicates that differences between the group m isclassification error rates did indeed increase with the total error rate so that the transformations were not reall y necessary here though the analysis is included for completeness. B oth analyses showed that the R * f(. ) * e second order interaction was highly significant so that results are presented i n terms of these three factors as in the first two analyses. Figure 4.3a shows that in the case of the data sets that were normally distributed, the apparent group error rates were very similar but there was a large difference between the cross-validated group error rates, with the class having the larger sample size having the substantially larger error rate. In contrast, the apparent and cross-validated error rates for LOA were very close together, bein g affected m ost by the change in covariance structure rather than sample size. This is illustrated by the error rate for class 2 being l arger than that for class I when both e = 4 and 5. Figure 4.3b shows a similar pattern to Figure 4.3a except that the differences in group error rates for LOA have greatly increased using lognorm ally rather than normally distributed data sets. These patterns are further exemplified in Figures 4.4a and 4.4b. It has thus been shown, in the case of equal priors, that class sample size is the m ain factor in determining group error rates for CART when cross-validation is done, though this problem does not m anifest itself when used to calculate the group error rates from the learning sample. Sample size, however, as expected, had no influence on the group error rates for LOA, in the case of equals priors which were instead influenced by the covariance structure and distribution of the data set 99 Comparison of the Difference between Cross-Validation and Apparent Group Mlsclasslflcatlon Error Rates using LOA and CART with Equal Priors: (a) Nonnally Distributed Data 0. 1 5 0. 14 0. 13 0. 12 0. 1 1 -::- o. 1 o ' .?? 0.09 - ? I 0.08 ? ? 0.07 a: 0.06 0 .05 0.04 0 .03 0.02 . ? /-c- .--- - -?0 ? ???? ??? ?? 0 .01 ------ - -- -+-------- -?--?-??--?? 3 4 e METHOD ????- CART.AP rc-c? CART.CV Figure 4.3a ? LDA.AP o-o o LDA.CV 5 Comparison of the Ratio of Cross -Validation and Apparent Group Misclassification Error Rates using LOA and CART with Equal Priors: (a) Normally Distributed Data .... ' CV a: 3 ::::::: 2 - -- - --- - --- - - -- ?- -- - - - ---c--, '? ? ', V,/ 0 3 ----- _./ -------.------ 0 . . 4 e _ ... . .... ,/ ' ? / / _./ METHOD ....... CART.AP c-c-c? CART.CV Figure 4.4a ? LDA.AP o ..,. o LDA.CV _ ... '? 0 t 5 (b) Lognormally Distributed Data 0.6 0.5 ? 0.4 ' CV Pr I 0.3 CV ' ? 0.2 - - - - - --? - - -- -- --- e 0.1 --- -- - - - ?{: - - 0.0 _,_, _, _ __ ,_,_, _, _, __ -?- -??? ,_,_ .. _ .. ___ ....... - .. .. ---. 3 4 e METHOD ? ?-- CART.AP c-c-c? CART.CV - LDA.AP ? o o LDA.CV Figure 4 .3b (b) Lognormally Distributed Data 4 3 __ . / .// - ? --- ? -? -------- .. -?- ?- ?- -- -c-., ?, ______ ./_/ ?,,,_ V/. ---- ----- __ / /_ .. ?? ' ' ' ---------? 5 ........ or-----------??----------? 3 4 e METHOD ..... . CART.AP c-c-c- CART.CV Figure 4.4b - LDA.AP ., . ..,. o LDA.CV 5 The fifth analysis compared the CART and LDA cross-validated group error rates using the difference between error rates as the measure of performance, w hile the sixth analysis used the ratio of error rates, in the case of both equal priors and P PSS. As i n the previous analyses, the R * f(. ) * e interaction was very important so the results will be presented in terms of those three factors. Figures 4.5a and 4.5b show that choice of priors did not affect the absolute difference in group error rates, no matter what the distribution of the data set As in the previous analyses, the group error rates for LOA were most affected by change in the covariance structure of the data set. With CART, a different pattern emerges. The use of equal priors rather than priors proportional to sample size has meant a reduction in the difference between group error rates. The greatest influence on individual group error rates for CART was sample size. Figures 4.6a and 4.6b confirm the trends mentioned above. A next step in the analysis was to compare the overall LOA and CART error rates using equal priors. As with the group misclassification error rates, the results were analysed using a split? plot ANOV A. The ANOV A showed that the R main effect (F = 1 1 1 . 8) was highly significant as were all the R * f(. ) interactions and a number of second order interactions. The two m ost important of these were the R * p * n (F = 8 .43) and R * p * f(. ) (F = 6.47) interactions. Figures 4.7a and 4.7b compare the n-fold cross-validation and apparent error rates over all combinations of dimension and sample size using equal priors. The two graphs suggest that there was a large difference between the R(CV) and R(A) error rates in CART for smaller samples, indicating the bias of the latter estimator in such situations. This bias was only minimal for LOA. Considering the cross-validation estimates only, CART did best when p = 2 and LOA when p = 6. Figures 4.8a and 4.8b illustrate the situation of the p * f(. ) interaction. As in Figures 4.7a and 4.7b, there was a large discrepancy between the R(CV) and R(A) error rates for CART, which did not arise for LOA. Looking at the R(CV) estimates shows that LDA did best for normal data and lognormal data when p = 6, while CART had the lowest error rate when p = 2 and the data was lognormal, as distinct from the situation of PPSS w here CART did best for lognormal data no matter what the number of vaxiables in the data set. 101 Comparison of the Difference between Cross-Validation Group Mlsclasslflcatlon Error Rates using LOA and CART with both Equal Prlora and PPSS: (a) Nonnally Distributed Data 0.34 0.32 ?, ?,., - 0.30 0.28 0.26 0.24 .:::. 0.22 ? 0.20 P:: J.. 0 . 1 8 ? 0 . 16 .::: 0. 1 4 P:: 0 . 1 2 0 . 1 0 0.08 0.06 0.04 0.02 3 ??, .......... ' ', ? , --- c - -??- ? _ __ _ _.-- 4 e ____ .. METHOD ?-?- CART.EP ? LDA.EP c-c-c CART.PPSS o " ? LDA.PPSS Figure 4.5a - ? (. ... -e . 0 5 Comparison of the Ratio of Cross-Validation Group Misclassification Error Rates using LOA and CART with both Equal Priors and PPSS: (a) Normally Distributed Data 7 ..... _ 6 ?., 5 '? ' ', '? ' ' 2 ---- - --??-????- ?? 0 3 METHOD ?-?- CART.EP ? LDA.EP Figure 4.6a 4 e c-c-c CART.PPSS o o- ? LDA.PPSS 5 (b) Lognormally Distributed Data 0.56 0.54 0.52 0.50 0.48 0.46 0.44 0.42 - 0.40 .:::. 0.38 N 0.36 p: 0.34 I 0.32 ' N' o.3o " 0.28 .::: 0.26 P:: 0.24 ? ...... _ _ __ .- ?-? __ __..-4 0.22 0.20 0. 18 0. 1 6 0. 1 4 0. 1 2 -- ------- - 0. 1 0 . - -? -? 0.08 ----- 3 METHOD ?- ? CART.EP - LDA.EP Figure 4 .5b (b) Lognormally Distributed Data 7 6 5 2 ' . ' . ------- ?- 4 e c-c-c- CART .PPSS . .. o LDA.PPSS '? ' ' ' ?. 5 0?--------------?--------------? 3 METHOD .... CART.EP - LDA.EP Figure 4.6b 4 e c-c-c? CART.PPSS O?C? -a IDA.PPSS 5 mparlson of the Cross-Validation and Apparent JsCiaSslflcatlon Error Rates using LOA and CART Equal Priors: (a) Dimension= Two 0.20 0. 19 1 ---?=::::::-:-?-:::::::::-:=--::-=? 0. 18 ? ? ?O 0. 17 0. 16 0. 15 0. 14 " -;; 0. 13 g: ... 0. 12 0 t 0. 1 1 I l-l 0. 10 0.09 0.08 0.07 - -- - - - - - -- ? ?? ? ? ------- ??? - 0.06 0.05 0.04 -- - ---- -- - - 60 METHOD I Figure 4 .7a Sample Size ???? CART.AP c-? CART.CV ? LDA.AP ? o o LDA.CV - -? ? 300 Comparison o1 the Croa.s-Validation and Apparent Mlsclasslflcatlon Error Rates using LOA and CART with Equal Priors and Different Dimension: (a) Normally Distributed Data 0.25 0.24 ,.e 0.23 ? ' ' 0.22 ? 0.21 0.20 ' 0. 1 9 ' ? u 0. 1 8 -;; 0. 17 0: 0 . 1 6 ? 0. 15 . ? ? ' .. t!i 0. 14 0 ? 0 . 13 ' 0 . 12 0 . 1 1 . . . .. ? . ?? 0. 1 0 0.09 0.08 ------- - --??-?-?- - --. 0.07 0.06 - -- - 2 i METHOD Dimension .. ... CART.AP c-c-c CART.CV ? LDA.AP ? . .., ? LDA.CV ?;gu c . ... ?? j=l . Now a( l , 1 ) = a(O, 1 ) + a( l , 0) - a(O, 0). As a( l , 1 ) > max { a( l , 0), a(O, 1 ) } => a(O, 0) < min{ a( l , 0), a(O, 1 ) } then, if ( 1 , 1 ) is assigned to TI1 , (0, 0) has to be assigned to TI2. This leads to gross errors in misclassification. The problem is in using a monotonically increasing function of x 1 and x2 to approximate the log-likelihood, L(x), which is not monotonic. In the two-dimensional case, the problem can be solved quite simply by including an interaction term in the model. When there are a large number of variables, however, this approach is infeasible. Others have shown that not only different correlation structures in the two populations lead to these reversals, but so will moderate and large positive correlations. Krzanowski ( 1 977) considered a mixture of both binary and continuous random variables with various values of P ij used and ri jk set to either 0 or 0.375 (no, or moderate, positive correlation). The results showed that LDA performed well when there was no correlation between the binary variables, but performed poorly if there was a moderate positive correlation among all the binary variables or the correlations between the binary and continuous variables differ markedly between the two groups. For both types of data, QDA has been found rarely to perform as well as LD A. Ganeshanandam and Krzanowski ( 1 990) compared a number of different error rate estimators for LDA as well as the n-fold cross-validation estimate for QDA, for multivariate binary data. Their simulation results showed that the Pij 's and sample size all had significant effects on the estimation of the actual erTor rate though the rijk factor did not. .3 SIMULATION STUDY I 5 .3 . 1 Study Plan The factors used in this study were the same as those employed in Ganeshanandam and Krzanowski ( 1990) in order to be able to check the results for LDA and QDA against theirs . . The factor p had settings of 5 and 10. A separate analysis was done for each of the two dimension levels. Factor n had three settings, those being "small, medium and large" relative to the number of variables. In the case of p = 5, n = 20, 60 and 100, while for p = 1 0, n = 40, 1 20 and 200. In all cases, 1t1 = 1t2 = 0 .5 implying that class sample sizes were equal. When p = 5 , factor rijk had two levels, those being, all rijk = 0 and all rijk = 0.25. When p = 1 0, all rijk were set to zero for the first five variables and to 0.25 for the second block of five. The last factor was the values of the Pij 's. The levels of the Pij ' s are shown in Table 5. 1 with level 1 corresponding to wide separation between groups with increasing levels leading to narrower separation. Level 5 corresponds to identically distributed populations. For p = 1 0, the Pi/ s for the first block of five variables were repeated for the second block of five. This gives 45 multinomial learning samples for p = 5 and 10 combined. Three replicates for each multinomial situation were conducted for p = 5 and six replicates for p = 1 0, giving 1 80 data 1 17 1 18 sets in total (90 for each dimension size) . Generation of the data was a straightforward process using MINIT AB macros. Table 5.1: Values of Pij for each set of five binary variables. Level n i n2 XJ x2 x3 x4 X) XJ x2 x3 x4 xs 1 0.20 0.20 0.20 0.20 0.20 0.80 0.80 0.80 0 .80 0.80 2 0.25 0.30 0.35 0.40 0.45 0.75 0.70 0.65 0.60 0.55 3 0.40 0.45 0.50 0.55 0.60 0.60 0.55 0.50 0.45 0.40 4 0.25 0 .30 0.35 0.40 0.45 0.45 0.40 0.35 0 .30 0.25 5 0.30 0.40 0.50 0.60 () _ 70 0.30 0.40 0.50 0 .60 0.70 The methods were compared by means of the n-fold cross-validation error rates, R(CV), except for FACT where ten fold cross-validation was used as outlined in Section 4.3. For both CART and FACT, the minimum size below which a node will not be split was set to five, while for CART alone, the one standard error rule was used. A split-plot ANOV A was used to identify which factors lead to differences between the methods, as in Chapter 4. Tables of means and the standard deviations of the differences between means are presented for each significant effect. 5.3.2 Results For the case p = 5, the results showed that the R (method) main effect (F = 1 0. 1 9) was highly significant and dominated the variation in error. All interactions involving R had F-values less than L L This meant that there were differences between the methods when using p = 5 binary variables, and these differences were not intluenced by other factors. In both the analyses for p = 5 and p = 1 0, the plot of residuals against fitted values showed no real trends, in contrast with the results for the continuous data. The plots of residuals against each method showed there to be roughly equal residual variances for each method. Individual ANOVA's were constructed for each method separately. These results confirmed the above finding of equal residual variances. Therefore, the results of the split-plot ANOV A are strictly valid. The average values for each method, when p = 5, are given in Table 5 .2, as well as the standard error of difference between the methods. The results showed that CART was the best method by some distance from LDA, FACT and QDA. Table 5.2: Means and standard error of the differences in means of the cross? validation error rate estimates for each method. = 5 Level LDA QDA CA RT FACT 0.343 0.368 .304 .358 Standard error of the difference between means = 0.0 13 . In the case of p = 1 0, the split-plot ANOY A showed that the R(method) m ai n effect (F = 20. 78) was highly significant, though the R * Pij (method by probability pattern) interaction (F = 4.2) and R * n (method by sample size) interaction (F = 4.9 1 ) were also highly significant (a < 0.00 1 ) . This showed that for p = 1 0, there were not only differences between the methods but that these differences were affected by both sample sizes, and to a s lightly lesser extent, the pattern of probabilities in the parent populations. Table 5 .3 shows that increasing sample size had the effect of reducing the error rates for all methods, except CART, where sample size had no real effect. The greatest reduction in error rate occurred when going from n = 40 to n = 1 20, for LOA, QDA and FACT. CART did best notably when n = 40 and slightly better than LOA when n = 1 20. When n = 200, however, LDA did better than CART. FACT was a poor fourth except when n = 40, where, because of the smal l ratio of dimension to class sample sizes, QDA did worst. 1 19 120 Table 5.3: Means and standard error o f the differences in means of the cross. vali dation error rate estimates for each cl assification method with respect to sample size (n). = 1 0 Level LDA Q DA CA RT FACT n = 40 0.329 0.40 1 0.284 0.366 n = 1 20 0.289 0.307 0.287 0.345 n = 200 0.274 0.292 0.284 0.333 Standard error of the difference in m eans = 0.0 1 6 . Increasing the level of the probability patterns, Pij? effectively narrowing the distance between populations, h ad the effect of increasing the error rates for all methods, with the two tree? based methods being less affected than both LOA and QDA. Table 5 . 4 shows that LDA did best for Pij = 1 and narrowly better than CART when Pij = 2 . For all other levels of Pij ' CART had the lowest average error rate. An explanation why LDA did best for Pij = 1 is that this is an exam ple o f a parallel classification problem (see Section 4.3), in t hat al l variables are equally important in determining the c lassification rules. Levels 2 to 4 for Pij are examples of a sequential classification problem , in that only a subset of the variables are ever used to determine the classification rules. Methods such as LOA and QDA are suited to the former type of problems while CART is designed for the latter. The good performance of C ART for less well separated populations m irrors what was observed for continuous data in Chapter 4. A noteworthy trend observed here was that the average error for C ART in the case of identical populations (Pij = 5) was 0.43, so that CART was managing to build a tree from noise. All other methods for Pij = 5 conta ined error rates in the vicinity of 0.5 . Table 5.4: Means and standard error of the differences in means of the cross? val idation error rate estimates for each classification method with respect to probabil i ty patterns (Pij) . = 1 0 Level LDA QDA C A RT FACT Pij = 1 0.073 0.097 0. 1 32 0. 1 75 Pij = 2 0. 177 0.230 0. 1 89 0.249 Pij = 3 0.346 0.389 0.325 0.4 10 Pij = 4 0.365 0.426 0.349 0.4 15 p? ? = 5 I] 0.524 0.525 0.430 0.492 Standard error of the difference in means = 0.02 1 . 5.3.3 Summary For five-dimensional categorical data, it was found that there were differences in the cross? validated error rates of the four methods, but these differences were due only to the methods and not to any other factor such as sample size or probabil ity patterns. It was found that C ART was clearly the best method followed by LOA. For ten-dimensional categorical data, the differences in error rates were due not only to method, but also to probabil i ty patterns and sample size. CART was found to be the least affected of al l methods by changes in either sample size or the pattern of probabil ities and had the lowest error rate for smaller sized samples and poorly separated populations. These results echo very much what was observed for CART in Chapter 4, for continuous data. In all other situations, LOA did best. In accordance with other studies, QDA performed poorly except w hen sample size was large or the two populations were wel l separated. As for the continuous data, FACT was a poor fourth in almost all situations. 121 5.4 SIMULATION STUDY II 5.4. 1 Introduction 122 Section 5 .3 comp ared the accuracy of four classification methods in the setting of categorical data, specifical ly w ith p-variate binary data. This section compares the reliabi lity of the cross-validation error rate estim ates produced by three of the fou r methods as well as a comparison of different error rate estimators using both LOA and QDA Further analyses for FACT were not done due to the poor predictive capability exhibited by the method through the simulation studies. 5.4.2 Study Plan The first analysis compared the n-fold cross-val idation, R(CV) , and apparent, R(A), estimators for both LOA and QOA, as wel l as the rotation, R(ROT), and 0.632, R(0.632), estimators for LDA alone. The latter was calculated as in (4.2.2 3 ) . For both p = 5 and p = 1 0, when n was small, the R(ROT) estimates could not be calculated for QOA as n i ? p for each class. Therefore, the R(ROT) and R(0.632) estimators for QDA were not included in the analysis. The R (CV) estimators for LOA, QOA and CART were then compared to test their reliability in estim ating the actual error rate, R(T) , for each data set used in Section 5 .3 . Test samples of size 5000 were used throughout to calculate the values of R(T). A comparison of different error rate estimators for CART, using categorical data, will be undertaken in a latter chapter. 5.4.3 Results The results are presented in the same format as Section 5 .3 . Comparing the error rate estim ators for LDA and QOA first; in the case of p = 5, the split-p lot ANOV A showed that there was a definite difference between the error rate estimators , R, (F = 34.76) and that those d ifferences depended to a large extent on sample size, R * n (F = 7 .64) and, to a m uch lesser degree, on the pattern of probabilities in the ni ' s, R * Pij (F = 2.37). These results very closel y follow the results exhibited in Ganeshanandam and Krzanowski ( 1 990) where the R * n effect was also more important than the R * Pij effect. Any effect involving rijk had very l ittle effect o n the estimation of R(T) . For p = 1 0, it was observed that, in addition to the R, R * n and R * Pij effects being highly significant, the R * Pij * n interaction (F = 3.96) was also significant at the 0. 1 % level. Ganeshanan dam and Krzanowski gave no indication of the importance of this second order interaction, but, it must be taken into account in any analysis of means implying that the error rate estimators for both LDA and QDA were influenced by sample size in conjunction with the pattern or probabilities. It is clear from Table 5 .5 that the mean square errors (MSE's) for all estimators decreased as sample size increased, that is, the estimators were more precise for larger rather than smal ler sized samples. The R(CV) estimator was the least affected by altering sample size while the R(A) estimators were most affected. The R(CV) estimator for LDA did best for the smallest sized samp les, the R(ROT) estimator for n = 60 and the R(CV) estimator for QDA for the largest samples. These results were in relatively c lose agreement to those of Ganeshanandam and Krzanowski ( 1 990), although their results showed sample size to have no effect on the MSE for the R(0.632) estimator. Note that the standard error of the difference w as not c alculated from either of the apparent estimators. This occurred, because the variation in MSE's for these two estimators was four to five times larger than that for the other estimators. Table 5.5 Means and stand ard error of the d i fferences i n m eans of the M S E's for each error rate esti mator with respect to sample size (n) ( * 10-4) = 5 Level n = 20 n = 60 n = 100 LDA, R(CV) 1 39 56 46 LDA, R(A) 443 1 07 4 1 LDA, R(ROT) 1 8 1 47 35 LDA, R(0.632) 200 49 25 QDA, R(CY) 2 1 6 7 1 23 QDA, R(A) 685 252 1 55 Standard enor of the difference between means = 37.2. 123 124 Tabl e 5.6 shows that the error rate estimators exhibited different behaviour over the various levels of Pij ? The R(CV) estimator, for LOA, had lowest MSE for P ij = 1 and h ighest for Pij = 3, while the R(0.632) estimator had lowest MSE for Pij = 3 and almost the h ighest at Pij = 1 . R(CV) did best overall for Pij = 1 , 4 and 5, the R (0. 632) estimator for Pij = 2 and the R (ROT) estimator for Pij = 3. In the case of p = 5, these results fairly closely m atched those o f Ganeshanandam and Krzanowski ( 1990) where R(0.632) was found to perform best for less well separated populations. Table 5.6: Means and stand ard e rror o f the d i fferences in means of the M S E's for each error rate estimator with respect to p robabi l i ty p a tterns (Pij ) (* 10-4) = 5 Level Pij = 1 Pij = 2 P ij = 3 Pij = 4 Pij = 5 LDA, R(CV) 22 1 3 1 85 85 75 LOA, R(A) 74 1 84 2 1 4 268 245 LDA, R(ROT) 86 96 48 1 20 89 L OA, R(0.632) 1 1 8 74 63 1 1 8 84 QDA, R(CV) 56 1 1 4 1 09 1 1 7 1 20 QDA, R(A) 172 2 1 9 48 1 485 462 Standard error of the difference between means = 48. Figures 5 . 1 to 5 .3 show the MSE' s for the different error rate estimators over the d ifferent levels of Pij ? for sample sizes of 40, 1 20 and 200 respectively. The apparent estimators for both LOA and QOA are not shown due to their exceedingly h igh MSE's in most cases. The general trend for the other estimators was an increase in M SE as sample size decreased while decreasing the distance between populations generally increased the MSE, though there were some exceptions. The graphs show that the R(CV) estimator was the most consistent, and thus rel iable over al l combinations of pro babi l i ty pattern and sample size. The R(CV) estimator for QOA was sometimes the most rel iable estimator, but in other situations was the least rel iable, espec ially for larger sample sizes. The R(0.632) estimator did best for poorly separated populations and n = 1 20 or 200, though not for populations which were the same (Pij = 5) . Compulton of the 1181'1 of Pour Jrror JatlmatJOD llethocla tor LD! and QDA tor Sample Size? of 40 p = 10 0.02 -1 m o.J 0.00 -1 Jlpre 5.1 I I I I ? 1 p. I \ \ ? ,-, ? . I \ ?I \ 1' . \ \ . I . \ , . I G ' .\ I \ . \ I . . \ \ I p.... ? . . -? A I I ,,.... ?\. I r 2 3 4 5 Probability Pattern Level A - LOA, R(CV) 8 - LOA, R(ROT) C - LOA, R(0.632} D - aJA, R(CV) 1 1 Compulton of the IIBI'a of Pour Error latlmatlon Method? tor LDA and QDA tor Sample Size? of 120 p = 10 I 0.015 0.010 ? 0.005 -1 rtpre 5.2 I 1 2 71 I \ \ I \ \ I \ I \ A I -\ ' r' '\ ? c 1 I I \ . ' 1 .C . B l I? I . . . . \ . " . . ? I ? . . . 1." ? I I I 3 4 5 I Probability Pattern Level A - LOA, R{CV) B - LO-', R(ROT) C - LOA, R(0.632) D - aJA, R(CV) 1 1 1 1 1 1 1 1 Comparhon of the KBI'I of rour lrror latlmatloD. Method? for LDA ancl QD! for Sample stze1 of 200 p = 10 0.008 0.007 0.006 0.005 I l l 0.004 1 1 0.003 1 1 0.002 0.001 0.000 1 2 3 4 5 Probablltty Pattern Level A - LOA, R(CV) 8 - LOA, R(ROT} rtpre CS.S C - LOA, R(0.632) D - OOA, R{CV) A comparison of the rel iability of the n-fold cross-validation error rates for LDA, QDA and CART showed that when p = 5, the R main effect was not significant. That is, there were no differences in the rel iability of the R(CV) estimators between the three methods. For p = 1 0, the R main effect (F = 3.45) was the onl y significant result, and that only at the a = 5% level. Table 5.7 shows that both the LDA and QDA R(CV) estimators were roughly equally reliable while that for CART was comparatively high ( roughly twice the magnitude of the LDA estimate). This discrepancy was due mainly to the overoptimistic estimates for R(T) when Pij = 5 produced by cross-validation. T able 5.7: Means and stand a rd e rror of the d i ffe re nce in m ea ns o f the M SE's for t h e n-fold cross-va l i d ation e rror ra te esti mates for each cl assi fication m e thod (* 10-4) = 10 Level LDA QDA CA RT 43 48 82 S tandard enor of the difference between means = 1 7 . 5.4.4 Summary 126 For five-dimensional categorical data. it was found when comparing the reliability of various e nor rate estimators for LDA and QDA that sample size and probabi l i ty patterns for each population were important i n determining differences between error rate estimators. The R(CV) estimator for LDA did best for either small sam ples or large differences between populations while the R(ROT) and R(0.632) estimators were better for larger samples or smaller differences between populations. For ten-dimensional categorical data, it was found that the interaction of sample size and probability pattems was important in differentiating the estimators, while R(CV) was the most rel iable estimator overall. In comparing the n-fold cross-validation estimators for LDA, QDA and CART, i t was found that there were no differences present in the five-dimensional case. However in the ten? d imensional case, LDA and QDA had the most rel iable estimates with CART some d istance behind. 5.5 CONCLUSIONS In this chapter, four c lassification methods were compared in the setting of p-variate categorical data. In the case of five-dimensional categorical data, the only significant effect was the overall d ifference between the methods, where CART was fou nd to be the best method. For the ten-dimensional data, the resul ts followed a very similar pattern to those for the continuous data in that LDA did best when every variable counted an equal amount for the classification rules or where there was fairly good separation between groups. CART, on the other hand, did better for less wel l separated groups or where only a few vari ables were important to the creation of the classification mles. As wel l , CART did better for smaller samples and LDA for larger samples. thus appears panicul arly useful for categorical data. A comparison of different e rror rate estimators for LDA and QDA showed that the n-fold cross-validation estimator for LDA was the better estimator of the actual error rate. A comparison of the n-fold cross-vlidation error rate estimators for LDA, QDA and CART showed that the CART estimator was the least reliable of the three. The last finding led to a com parison of the various error rate esti mators for C ART using categorical data. the results of which are reported in Chapter 6 along with a comparison of the various estimators for continuous data. Another interesting finding was that there was only a minor difference between the third and fou rth levels of the p robability pattern variable (see Table 5 . 1 ) . Ganeshanandam and Krzanowski ( 1 990) reported that the error rates for the third level were twelve to seventeen points higher than those for the fourth leveL The results here have shown that difference to be approximately only a few points. It is possible that Ganeshanandam and K rzanowski ( 1 990) actually used different probability patterns than those stated, because theoretically, the error rates for the third level should be closer to those in level 4 rather than level 2 as occurred in their studies. 127 128 6. CART SIMULATION STUDY 6. 1 INTRODUCTION In Section 4.6, an investigation was canied out into the performance of various techniques for determining tree size and estimating the actual error rate in CART. Recent studies have suggested, however, that the range of Bayes error rates directly affects the performances of the error rate estimators, especially the 0.632 estimator for CART in the c ase of continuous data. The objective of this chapter was firstly to com pare the various error rate estimators over a wider range of Bayes error rates, and reduced sample sizes from those studied in Chapter 4. Numerous studies (Efron, 1 983, Gong, 1 986 and C rawford, 1989 for i nstance) have shown that sample size is a crucial factor in deten i ning the perfonnance of an error rate estimator. A second objective was to compare the vari ous error rate estimation tec hniques for CART over the categorical data sets that were used in Chapter 5, to determine if simi lar patterns as were observed for continuous data could be seen. Thirdly, based on the comments of Feng et a! ( 1 993), a compa1ison between the zero and one standard error rules for selecting the right sized tree, was canied out, in order to determine in which situations, if any, one should/should not use the one standard error rule. A final objective was to brought about by a study of Fitzmaurice et al ( 1 99 1 ) . They affirmed that the untransformed en?or rate scale, bounded by 0 and l , may not be appropriate for the comparison of different methods. Thus, a num ber of transfonnations were c arried out on the error rates and the effects of the transformations analysed. 6.2 ERROR RATE ESTIMATION FOR CONTINUOUS DATA IN CART 6.2.1 Previous Studies Thi s study was m o tivated by the work of B reiman et al ( 1 984) and C rawford ( 1 989). Breiman et al in S ection l l . 7 , suggest that o n the basis of tests on both real and simulated data sets, the bootstrap error rate estimate had lower variance than the cross-validated error rate estimate, but w as h ighly overoptimistic when compared with those based on g-fold cross- 129 130 validation. W hen the learning sample is large. they state that the bias effect dominates the variance so that the g-fold cross-validation estimator is superior to bootstrapping. B reiman et al ( 1 984) suggested that perhaps a modified bootstrap estimator could be used to determine both optim al tree size and provide an estimate of the actual en?or rate. Crawford ( 1989) tested these assertions by comparing the performance of cross-validation, the bootstrap and the 0.632 bootstrap, using R (0.632) = 0.368 * R(A) + 0.632 * R(E). He found for small data sets (n = 20) that the 0.632 estimator, R(0.632). was the best in terms of having the lowest mean sq uare en?or (MSE). while for larger samples (n = 1 00), the cross? validation estimator, R(CV). was best for high values of R(B) but R(CV) for low values of R(B). Crawford suggested the use of a combined strategy whereby n-fold cross-validation was used to select the right sized tree and R (0.632) to estimate R(T) on the selected tree. Crawford concluded that this combined approach minimised the chance of poor performance when faced with either a high or low value of R (B). In Section 1 1 .5 , Breiman et al. p 85. affirmed that " . . . we have not come across any situations w here taking [g] larger than 1 0 gave a significant i m provement i n accuracy for the tree selected." They suggest that the use of ten fold cross-validation gives adequate accuracy in m ost situations, and indeed. this is the default value used within the CART program. As yet, no results have appeared in the literature validating these assertions. 6.2.2 Study Plan The aim of this study was to use CART to compare the performance of the n-fold and ten-fold cross-validation, rotation and 0.632 estimators. along with the associated apparent error rates in approximating the actual error rate. R(T). of the sam ple. As in Section 4.6. R (T) was found by using an independent test sample of size 5000. The objective was to expand on the work of Breiman et al and Crawford ( 1 989) in order to fi nd out which method was the best in selecting the most "honest" sized tree. The error rate estimators were as used in Section 4.6. The 0.632 estimator, using R (ROT) in the equation i nstead of R(?) provides a simple alternative to the combined strategy proposed by Crawford ( 1989). His proposal involved a double calculation, hence a large increase in processing time, in that n-fold cross-validation w as needed to select the right-sized tree, then B bootstrap samples had to be generated i n order to estimate R(T). This version uses the rotation method to calculate the right-sized tree and then uses that en?or rate in the equation for R(0.632). In addition, a comparison of the sizes of the trees produced by each method w as m ade to determine if there were any differences between methods. The data were generated from two multivariate normal populations, as it is known that CART is invariant under monotone transformations of the variables. Three factors were varied in a full factorial design ; R(B ) , the Bayes enor rate ; n . the sam ple size with rc 1 = rc2 for all cases; and q, a combination of dimension (p), means and correl ations between variables. The values of the first and second factors were: (i) R(B) = 0.05 . 0. 1 5 , 0.25 and 0.35. (ii) n = 20, 1 00. The third factor, q , had levels whereby J.l Jj = 0 for all j , where J.l ij is the mean of the first population for variable j ; and J.l2j is the mean of the second population for variable j . 1 : p = 2, J.l21 = J.l.22? p = O 2: p = 2, J.l21 = J.l22, p = 0.5 3 : p = 4 whereby J.l2 1 = 2).1.22 = 6).1.23; J.l24 = 0 1 0 -0. 5 0 l 0 0. 7 5 0 1 0 0 0 1 0 and PI = and P, = -0. 5 0 1 0 0 . 7 5 0 l 0 0 0 1 0 0 0 where Pi = [(Pijk)] is the population con?elation matrix for fh 0 0 0 The values for the first and second factors were similar to those used b y Crawford ( 1 989) and Fitzmaurice ( 1 99 1 ) , except that no studies were done for R(B) = 0.45, where, as noted by Fitzmaurice et al, any classification rule which produced an error rate in the region of 0.45 would probably not be widely used. 1 3 1 1 32 The levels of the third factor were chosen after some conclusions by Quinlan ( 1993) about parallel and sequential classification problems and summarised in Chapter 4. Therefore, in this study, the first two levels of factor q cotTespond to situations which are less favourable to CART while the third level corresponds to a situation m ore favourable to CART. Each of the 24 factor combinations was used for 25 simulations. The number of simulations was chosen so as to be able to adequately depict the true trends. The effects of a few 'bad' samples wi l l be m i nim ised by the large number of 'good ' samples. Four criteria of performance were used to compare the various error rate estimators, namely, the bias of each technique i n estimating R(T), where bias = R(n - R(T), T = CV, A, ROT, 0.632 or TEN the standard deviation of the bias; thirdly, the MSE, where MSE = E[(R(t) - R(T))2] . A fourth m easure used was the COUNT cri tetion. corresponding to the proportion o f samples for each factor combination in which the estimated error rate was less than the actual error rate, and is therefore a measure of the optim ism involved in using each estimator. A large value for COUNT, say > 75%, would correspond to an overoptimistic estim ation whereas a low value for COU NT, say < 25%, would correspond to a pessimistic or under optimistic estimation. For all the data sets in this section. the zero standard error rule was employed while the size below which a node will not be split was set to five. Independent test sam p les of size 5000 were used throughout to determine R(T). 6.2.3 Results As in Chapters 4 and 5 , a split-plot ANOV A was used to assess the relative i mportance of the experimental factors i n int1uencing the MSE's for the various error estim ation techniques. A large n umber of repl icates were catTied out in this study, in contrast to C hapters 4 and 5, hence the F-ratio from the ANOV A should not be regarded as a true measure of the significance of each result. The R (method) m ain effect (F = 87 .68) and R * R(B ) (method by Bayes error) interaction (F = 20. 1 8) dominated the other R * factor interactions. All other effects were very s m al l though, rather surprisingly, the R * R(B) * n * q interaction was the next largest (F = 2.57) . Therefore, i t was decided to compare the four error esti m atio n techniques over a l l possible combinations of R(B) , n and q. The results of the average bias, standard deviation, M S E ' s and COUNT's are presen ted graphically in Figures 6 . 1 to 6.24. Only the R(CV), R(ROT), R(0.632) and R(TEN) estimators are shown as the values for the respective apparent estimators were on m ost occasions highly overoptimistic, leading to extreme values of the above four measures. Figures 6. 1 to 6.6 show the a verage bias val ues for each estimator over the ranges of the factors used. The results show that for almost all values of R(B), n = 20 and q = 1 and 2, that R (0.632) had the lowest bias. All other methods were m a rkedly overly pessimistic in the estimation of R(T). The exception to the rule was when q = 2 and R(B) = 0. 1 5 . When q = 3, however, a differen t picture emerged . The R(CV), R (0.632) and R(TEN) estimators al l had similar bias with this bias i ncreasing pessimistical ly as R(B) increased. The R (ROT) estimator was consistently m ore pessimistic than the other three estimators. For larger samples (n = 1 00), it is noticeable that the perfo rmance of the R(0.632) estimator deteriorated as R(B) increased in that the bias became overly optim istic . The R(CV) esti m ator was usual ly the least biased for R(B) = 0.05 and 0. 1 5 , but deteriorated for h igher R(B ) . In those situations, R(TEN) produced the lowest en?ors. For highly separated populations (R(B ) = 0.05), the R (0.632) estimate was comparable to or better than the R(CV) estimate. As with the smaller samples, the R(ROT) estim ator was consistently pessim i stic. Turning to the standard deviations of the estimates. it can be seen that for smaller samples (Figures 6.7 to 6.9) , for q = 1 and 3, that all estimators exhibit a distinctive inverted U shaped pattern in that the lowest standard deviations occurred for either the lowest or h ighest R(B). For q = 2, the trends were different, for all estimators. In terms of performance, the R(0.632) estimator had lowest variability when R(B) = 0.05 and 0. 1 5 , though highest variability when R (B ) = 0.25 and 0.35. The large variability of the R(CV) and R(TEN) estimators is clearly evident. 133 Compullon of the Bill of Pour Brror Batlmatlon letho4a tor CAlf Ylth Sample Size? of 20 ancl ct=l Comparllon of the Btaa of Four lrror llttmatlon lethodl tor CART with Sample Stzea of 20 and q=2 Comparhon of the Bill of Four Brror lattmatlon lethocll tor C!RT with Sample Slzea of 20 and q=3 0. 10 0. 10 0.05 ? ? .. .. ? 0.05 ? e . . . . . . e . . . C ? ? ? ? ? ? C . 0.00 c 0.00 ...,_---r---?--r' 0.05 0. 1 5 1\?Q.te 8.1 R(B) 0.25 0.35 A - R(CV) B - R{ROT) 0.05 C - R(0.632.) 0 - R(TEN) \ \?ll'lfe 8.2 0 . 15 . c. 0.25 R(B) c 0.35 0. 10 = 0.05 ? 0.00 I ? I I I I I I I I C . I I I s. - - - -? I I I I I I \ J \ ? 0.05 0. 15 R(B) 0.25 ' ' ' ' 'B 0.35 A - R(CV} 8 - R(ROT) I A - R(CV) 8 - R(ROT) C - R(0.632) D - R(TEN) .JfiUfe 8.8 C - R(0.632) D - R(TEN)) Comparllon of the Blaa of Pour lrror lltlmatlon letho4? tor CART Ylth Sample Slze1 ot 100 and q=l ID ... 0.03 0.02 0.01 0.00 iS ?.01 -().02 ?.03 ?.()4. 0.05 c ? . . . . . ? 0.15 R(B) e 0.25 0.35 A - R(CV} B - R{ROT) Comparhon of the Blaa of Four lrror Batlmatlon llethoda for CART with Sample Slzea of 100 and q=2 ID ... j:Q 0.02 0.01 ..... / ..... / '8-- - - - -e"' / / .B /,,.....&?-?-?s-.._ 0.00 -l [Y""'' ? -e -<>.01 e -<>.02 . c .,, . . . . c -o.03 0.05 0. 15 0.25 0.35 R(B) Comparhon of the Blaa of Four Error lltlmatlon lethodl tor C!RT wlth Sample Slze1 of 100 and q=3 ID aS iii 0.03 0.02 0.01 0.00 -{).01 -{).02 " B. / jJ I / I I I I 8 I I " ' / ' / ' / ' / ' / / 0.05 'g 1 1 . I "-........?? ?c ? ? ? ? ? ? t. ? ? ? ? ? ? c 0. 15 0.25 0.35 R(B) Plpre 8.-4 C - R(0.632) D - R(TEN) I IFipre 8.6 A - R(CV) 8 - R(ROT) I A - R(CV) 8 - R(ROT} I C - R(0.632) D - R(TEN) Plpre 8.8 C - R(0.632) D - R(TEN) Compulton of the Stu.dard ?latlone of Pour lrror leUmaUon Jlethocb for CABT nth sample Size? of 20 and q=l 0. 1 1 0. 10 ? - 0.09 .. .. ! '2 0.08 ? 0.07 0.06 I? I \ I \ f A \ :s \ / . \ ) ,.?/ \ ' ? .\ / ' '\ / . \ . . /. / : ' \ ?c ll./ ,.... 8 . \ B' ..... ..... . \ \ 8' ..... . \ ,\ \ c e? 0.05 0.15 0.25 R(B) \ \ \ \ \ \ \ I!J 0.35 A - R(CV) B - R{ROT) Compuhon of the Stu.dard Deriatlone of Pour lrror letlmatlon Method? for CART nth Sample Slzee of 20 and q=2 0. 10 0.09 R 0 0.08 - ? Ill -? J:: 0.07 ? ... ? ? ril 0.06 0.05 ? 0.04 J I 0.05 c j"'1. A l :. " / ... . J3 /1 \ ?y? . I . c . I \ . I \ . I . , \ \ :; \ } \ I ' \ I .' B . ?e I I I I 0. 15 0.25 0.35 R(B) 1 1 -=a 0 -.. Ill Comparllon of the Standard Deriatlou of Pour Brror l1tlmatlon Method? for C!RT Ylth Bample Size? of 20 and q=3 0. 1 1 G 0. 10 0.09 ?? 0.08 ? ?? . \ '? . \ c \, & ? ... ? 0.07 ? .. Cll 0.06 0.05 l e 0.04 0.05 0. 15 R(B} \ \ 0.25 \ \ \ 8 0.35 W\ptre 8.'7 C - R(0.632) 0 - R(TEN) \ \l't.,ue 8.8 " - R(CV) B - R(ROT) I A - R(CV) s - R(ROT) C - R(0.632} 0 - R(TEN} Pfp?e 8.Q C - R(0.632) D - R{TEN)I For l arger samples (Figures 6. 1 0 to 6. 1 2), a more linear trend is apparent for all estimators i n that variability increased as R(B) i ncreased. For q = 1 and 2, the R(0.632) esti mator was often the least variable estimator for higher R(B) with roughly similar variability to the cross? v alidation estimators for lower R(B) , but for q = 3, it was the best estimator for low R(B ) and worst for h igh R(B ) . I n the next analysi s, the bias and variation o f the estim ators were combined into the MSE c riterion. For smal l samples (Figures 6. 1 3 to 6 . 1 5) , i t i s clear that for q = 1 and 2 , the R(0.632) estimator did best except when R(B ) = 0.35 and that the best results occurred at low R(B) and the worst at m oderate R (B) . It should noted, though, that the R(0.632) estim ator was the least affected of all estimators by changes in the values of R(B). In accordance with the results of Crawford ( 1 989), the R(CV) estim ate had high M SE, due mostly to the large variability, as shown in Figures 6.7 and 6.8. For q = 3, a slightl y different picture e merged. All estimators had roughly similar MSE except R(ROT) when R(B) :::; 0.25. In the case of n = 1 00, simi lar trends can be seen in all three graphs (Figures 6. 1 6 to 6 . 1 8) . Generally. the performance of each estimator deteriorated as R(B) increased. The R(CV ) and R(TEN) estim ates performed very m uch the same. The R (0.632) estim ator did best overall for q = 1 and 2 while the R(CV) and R (TEN) estimators had lowest MSE for q = 3. Figures 6 . 1 9 to 6 . 24 provide another measure of performance giving the values of the COUNT's of optim ism for each estimator. Val ues c losest to 0.5 were the m ost ideal. For smal l samples (Figures 6 . 1 9 to 6 .2 1 ), the trends exhibited are very similar to those exhibited for bias. For q = 1 and 2, the R(0.632) estimator produced the most unbiased estimates of e rror while the other estimators were highly pessi mistic. For q = 3 , all methods, except R (ROT), had similar COUNT's. For R(B ) = 0.05 and 0. 1 5 , these estimates were around 0.5 but for R(B ) = 0.25 and 0.35, the estimates were highly pessimistic. For larger samples (Figures 6.22 to 6.24), it is clear that for q = 1 and 2, the R(0.632) estimator was consistently optimistic. For q = 3, this also occurred when R(B) ;:::: 0. 1 5, but for R(B) = 0.05, the proportion of samples where R(T) was either over or underestim ated was roughly 0.5. For q = L R(CV) did best while for q = 2, R(TEN) did best with R(CV) tending to be rather optim istic . For q = 3, both R(CV) and R(TEN) performed equally well. The overly pessimistic nature of the R(ROT) estimator is reinforced by these results. In other words, R (ROT) was m uch higher, on average, than the actual error rate, R(T). 137 ? 0 - Comparl10n of the Btuclarcl De?latlona of Pour Error lltlmatlon Jlethoclt tor CABT with Sample Slzea of 100 and q=l 0.07 0.06 /fJ ? 0.05 /./ I? -& ooQ ... G -; 0.04 .... tll " 0.03 ?B'? 0.02? 0.05 .i 1 , ' I ? ./ I . I , ' I I ' I 0' ?IF . . I . . 0. 15 I I ' -8 I ? c 0.25 0.35 R(B) A. - R(CV) B - R{ROT) 1 1 1 1 Comparhon of the St&Aclarcl Deriationa of Pour lrror latlmatlon Method? for CART with 5ample Sizes of 100 ancl q=2 0.05 .? ? 0.04 I I I ? 0 - ? Clll -... ? ? ... Clll ? ? 0.03 /j . e tll c ? 0.02 0.05 0. 15 0.25 0.35 R(B) ? -? Comparhon of the Stuclarcl Deriatlona of Pour Error latlmatlon llethocl1 for CART with Sulple Size? of 100 and q=3 0.05 I I I I I c 6!- - -:- e I I C .! 0.04 & I I . 'f i ril 0.03 w1' o.o2 ? c? 0.05 I I I . ' I I . ' 0. 15 0.25 R(B) 0.35 1\au?re 8.10 C - R(0.632) 0 - R(TEN) I I nave 8.11 A. - R(CV) B - R(ROT) I A - R(CV) B - R(ROT) C - R{0.632) 0 - R{TEN} P!fU'? e.J2 C - R(0.632) D - R(T?N)1 Compullon of the MSI'a of Pour lrror latlmatlon Kethoda for C!RT nth Sample Slzea of 20 ancl q=l 0.019 Comparhon of the KSI'? of Pour .Error !etlmatlon Kethod.l tor CART with Sample Size? of 20 and q=2 0.015 Comparllon of the KSI'? ot Pour Brror Bltlmatlon Wethoda for CART with Sample Slzea of 20 ancl q=3 0.02 Oompe.rllon of the IISI'a of Pour lrror latlmatlon llethocle tor cm Ylth Sample Size? ot 100 and ct=1 ? 0.005 0.00+ 0.003 0.002 0.001 c : /J ./ ./ ...a-;/ ,.:.. - -8 I' _8-" ,. 'l . .-d .. . I' f/ . I' ? I' ?.05 -<>. 10 ? \ .i\ \ I \ ?\ / \ \ I \ \ \ ? ? . \ / , ? , \ '\? ? \ I . \ . \ I . . . . j \? . . \ I . ? ; ? ? 1!1 ? ? \ ?c? ? / c .\ I "" I '"' 2 J 4 Probabl l l ty Pattern Level ? al ? Compullon of the Btu of roar Error lltlmatlon Method? for CART for S&mple Slzee of lOO p = ll 0.03 0.02 0.01 0.00 -{).01 -{),02 -{).03 -o.04 ,I?, lf\\\ ?f \\\ \ \ \ ? . \ \? - --1-a / . c ;? ? ? \ . 2 J 4 Probability Pattern Level """? e.? C - R(0.632) 0 - R(TEN) 1 1?11'1'? 8.28 A - R(CV) B - R(ROT) 1 A - R(CV) B - R(ROT) C - R(0.632) 0 - R(TEN) Pff1Ue 8.27 C - R(0.632) 0 - R(TEN)J Compul10n of the 1181'1 of rou Brror l1tlmatlon lethocl1 for CAlf for SIJDple Slze1 of 20 p = fi ? c 0.02 0.01 \ \ \ \ c \? \ : . ' ? '\ \ . . ' . t!?.:. - - .g \ 0.00-i, ' ' \1 I ? 1 2 3 4 Probability Pattern Level A - R(CV) 8 - R(ROT) Comparhon of the 181'1 of rou lrror latlmatlon lethodl for cm S&mple Size? of eo ? p = fi 0.015 0.010 . ? 0.005 0.000 I .e . . ? . '\ I . \ ' . ,\ . _ .g ? ., . . .s- - - . . c ?? . . ,. "" \ . ,. ? ? ,. e" 1 2 3 4 Probability Pattern Level Comparhon of the KBI'1 of Fou lrror lltlmatlon lethoda tor CABT tor Sample Size? of lOO p = fi 0.006 0.005 0.004 ? 0.003 0.002 0.001 1 2 3 4 Probability Pattern Level Plpre 8.28 C - R(0.632) D - R(TEN) l lrtaure 8.20 A - R(CV) 8 - R(ROT) I A - R(CV) 8 - R(ROT) I C - R(0.632) D - R(TEN) rtaure 8.80 C - R(0.632) D - R(TEN) Compar ison of the Bias of Four Error Es timation Methods for CART for Different Probabi l i ty Patterns 0.04 !11 a1 -0 .0 1 ..... ? -0.06 Fi&ure 6.31 p = 10 e- - - - - - - 8, ' ' ' ' ' ' ?. ? . . . . . . . c . ' ' ' \ \ \ \ 'R .... .... .... .... .... .... ?? \ . . ? ? '\- .-?-&?.........._ . .-? .............. . . ............. . '?) 2 3 4 Probab i l i ty Pattern Level A - R(CV) C - R(0.632} B - R(ROT) D - R(TEN) Comparison of the MSE's of Four Error Estimation Methods for CART for Different Probabi l ity Patterns 0 .009 0 .008 0 .007 0 .006 ? ? 0.005 0 .004 0 .003 0 .002 0 .00 1 Figure 6.32 ,B I' ' p = 10 I' ' I' ,/!7-?-,-?-?-B-..... I' / ' .............. /::,/ ' .............. . // ', '--o . c . . ' ' ' . ? \ , . ? c . ' . ' . . . . e- - - - - - .:. ? 2 3 4 Probabi l i ty Pattern Level A - R(CV) C - R(0.632) B - R(ROT) D - R(TEN) Figure 6.32 shows that i n terms of MSE, R (0.632) was clearly best for Pij = 1 and 2, w hile R(ROT) was the most reliable for Pij = 3 and 4. R(CV) was found to be the least reliable estimator in all situations. Interestingly, these results mirror those found for continuous data in that R(0.632) was shown to be the best estimator for parallel classification problems which were not suited to CART as well as for the best separated populations (Pij = 1 and 2). Note, though, that the rel iabil ity of the R(0.632) estim ate exhibited a distinctive U shape pattern, doing best for either well-separated or identical populations. Figure 6.33 shows that R(0.632) was the least affected by changing sam ple size and was the least biased for small samples. The high optimistic bias of R(CV) for small samples is c learly evident and remains optimistic for larger samples. In terms of MSE, Figure 6.34 shows that R(0.632) did best for n = 40. For n = 1 20, R(0.632) was slightl y better than R(ROT) while for n = 200, R (0.632), R(ROT) and R(TEN) did equally well . The R(CV) estimator for all sam ple s izes performed consistently the worst. A final analysis in this section compared the sizes of the trees resulting from each of the three error estimation m ethods for both p = 5 and p = 10 binary variables. For p = 5 variables, the split-plot ANOVA showed there to be a difference between methods (F = 7 .69) with R(ROT) producing the smallest sized trees (2 .03 terminal nodes) and R(CV) the largest (2.64 terminal nodes). For p = 1 0, again the method effect (F = 7.7) dominated all o thers with the R(ROT) estimator producing the smallest si zed trees (2.9 terminal nodes) on average. The average sized trees produced by R (CV) and R (TEN) contained 5.07 and 4.44 terminal nodes respectively impl ying that the rotati on method produced trees with 1 .5 less terminal nodes than any other method. 149 Comparhon of the Diu of Four Error Eetlmatlon Kethode for CART for Different Sample Size? 0.04 0.03 0.02 0.01 0.00 ... ?S ?- -o.01 ,Q -o.02 -o.03 -o.04 -o.os -o.06 Plaure 8.3S e., ' , ' ' ' ' ' ' , p = 10 ' ' '8.. ... ... ... ... ... ... ... ???. ? . ... B ?"" ............ : . . ./ ./ .,.? . . . . ./ ./ "?? ? ? ?c .../ . , 1.:1 'f) 40 120 Sample Size 200 A - R{CV) C - R(0.632} -?- - - B - R{ROT) D - R(TEN) ? Comparhon of the MSE' 1 of Four Error Estimation Kethode for CART for Different Sample Size? 0.0 10 0 .005 0.000 p = 10 ? "?" '" B , ' G . . 4() .... .... '?, ' , , ' ?, .... " ?'-.... .... , ??-- ?-- . . . . .... '? ?--?--?--.. . . . - - - - . . . . . . . . . . . ..... _ - ? ... - -'? 1 20 Sample Size 200 A - R(CV) 8 - R(ROT) Flaure 8.3-' C - R(D.632) D - R(TEN) 6.3.3 Summary This study has shown that for categOiical data, the R(0. 632) and R(ROT) estimators produced the m ost reliable classification trees as wel l as being the simplest. It was found that, as for the continuous data in Section 6 .2 , the R(0.632) estimator was the most rel iable for small samples, paral lel c lass ifi cation problems and well separated populations. As wi th the continuous data, and in agreement with other studies, the R(CV) estimator was found to be a p oor estim ator for small samples and h ighly optimistic, especially for poorl y separated populations. B ased on the results of this study, the use of n-fold cross-validation with categorical data is not recommended. ).4 THE STANDARD ERROR RULE IN CART 6.4. 1 Previous Studies The motivation for doing this study was provided by B reiman et al ( 1 984) . They recommended the use of the one standard error ( 1 -SE) rule for, firstly, the sake of accuracy, n oting that in m ost cases , the cross-vali dation estimate of error was over optim istic. Secondly, they stated that a plot of R(CV) against tree size had the characteristics of an initial sharp decrease fol lowed by a long, tlat vaJJey across a wide range of tree sizes and then an i ncrease for very small trees. Inside the long valley, most en?or rates were found to be within the ? 1 -SE range and that the position of the minimum may be unstable. The 1 -S E rule was used to reduce that instabil ity as well as produce trees which are as simple as possible. Feng et al ( 1 993) caiTied out a small-scale empirical study compa1ing the zero standard error (0-SE) and 1 -SE rules using various data sets. They found that trees produced b y the 0-SE rule were between two and ten times larger than those constructed using the 1 -SE rule, so that the latter were b iased towards simplicity. In determining which rule was better they were rather i nconclusive. "We believe that there is no single best rule, instead i t depends on how m uch "noise" there is in the data. If there is l ittle noise in the data, then the 0-SE rule should be used. If there is a lot of noise then .. . the 1 -SE rule should be used." (ibid, p 49.) 151 152 6.4.2 Study Plan I n Section 6.2, the performance of various error rate estimators was compared i n estim ation of the actual error rate using the 0-SE rule. In this section , only two estim ators, R (CV) and R (0.632), were used, one of which worked best in any one of the factor comb inations studied in S ection 6.4 . The two error estimation m eth ods were compared o ver the factor combinations studied i n Section 6.2 involving the Bayes en?or rate, R(B), sam ple size, n , and the third factor q, using both the 0-SE and 1 -SE rules, with the objective of determ ining in which situations eitJ1er of the above two mles should be used. Comparisons were made using both the bias and MSE perfonnance crite1ia. In addition, a compa1ison of the decrease in tree size produced by using the I -SE rather than the 0-SE rule was made for q = 3 only. 6.4.3 Results Figures 6.35 to 6.37 compare the average bias for n = 20. For q = 1 and 2, it is c lear that both the R(CV ) and R(0 .632) estim ators using the 0-SE rule were less b iased than the corresponding estimates using the 1 -SE rule except for R(B) = 0.05 . For q = 3, there was l ittle to choose between the use of the 0-SE or I -S E rules, except the R(0. 632) estimate for larger R ( B ) which was excessively pessi m i stic. Note too that the two R(CV) estimators exhibited very similar trends as fu nctions of R(B) while the two R(0.632) esti m ates did not. For q = 1 and 2, the average bias decreased as R(B) i ncreased using the 0-SE rule while the bias increased as R(B ) i ncreased using the I -SE rule. Figures 6 .38 to 6.40 il lustrate the cases of n = I 00. For q = l , 2 and 3, the estimates using the 1 -SE rule were generally less biased than those using the 0-SE rule, with the l atter tending to be over optimistic. For q = 3, the dispa1i ty in bias between the 0-SE and I - S E m le estimates was less m arked for low R(B) than high R(B). As with n = 20, the R(CV) estimates followed similar patterns while the R (0.632) estimates behaved rather differently. In terms of MSE, Figures 6.4 1 to 6.43 il lustrate the cases of n = 20. The trends shown are very s imilar to those for bias. For q = 1 and 2, the 0-SE estimates produced the lowest MSE except when R(B) = 0.05. For q = 3 , the difference between methods was m arginal . ---------- , Compuhon of the Blaa of Two lrror latlmatlon 11etho41 tor CART with eel without the One studucl lrror Rule tor Sample SJze1 of 20 aJlcl q=l ? .. = - 0.14 0. 12 0.10 0.08 0.06 0.04 I ' / /' [("'. . c . . . . . . e . e ? . . . . -A 0.02 ?c 0.00 --r 0.05 0.15 0.25 R(B) 0.35 A - R(CV) B - R(CV) - 1SE eom.,.rhon of the Btaa of Two lrror letlmatlon Jlethocb for cm with eel 1fJthout the One Stu.dard lrror Rule for sample Slzea of 20 eel q=2 0. 10 ? .. iQ 0.? 0.00 ?-------?-?? -?---a--?---., 0.05 I I I I I I I I I I \ I I I I \ I I I I I . . c . \ 0.15 0.25 R(B) \ \ \ \ \ \ \ \ /? . I 'C 0.35 A - R(CV) B - R(CV) - 1SE --??-?- -?----? - -------- ----? Compe.rhon of the Blu of Two lrror latlmatlon )(ethoda for cm '11th ed without the One standard !rror Rule for Sample Slzea of 20 ed q=3 0. 15 0. 10 ? Cll = 0.05 0.00 f\. I \ I ? I c /';,c I : ?{ ?I I / I / I t. /1 1 ? ted.? 0.05 0. 15 0.25 R(B) 0.35 A - R(CV) 8 - R(CV) - 1SE c - R(0.632) D - R(0.632) - 1SE I Iril'l'e 8.37 rlpre 8.36 C -_Rj?SJ2) _E._ - R(0.632l_:_? l!JIUI? 8.38 - -??-- C - R(0.632) D - R(0.632) - 1SE - -- --?? -???----.- .. -.. -----?-? .. ??-??? ? .. . ?----?-? -? ---.. ? - ? ?-? ???l CompuhoD of the Blaa of Two lrror latlmatloD llethocla for CART with llD4 without the One Stu4ar4 lrror lute for Sample Size? of lOO and q=l 0.03 0.02 0.01 0.00 ? ? - = -o.o1 -o.02 I I I I I ,A ? c ? . . . . . ? I ! I ?. ' -o.03-l ?--... e -o.04-, r -' 0.05 0.15 R(B) 0.25 0.35 ... .... - __ _. .............. ? ?--- - ? - - -- - - - - ???- - -- - - ? ?-----? ...... l ComparhOD of the Blaa of Two lrror laUmaUoD lle\hodl for cm wltb and wt\hou\ the One standard lrror Rule for Sample Size? of lOO and q=2 ? ? .... "' s. .... \ ,. -- ---?-?----?--?-- ---l o.oo -l'\. .... .... .... a, -o.01 -o.02 -<>.03 0.05 "' ' . ' ?- ', ?-? "'-0----, ?--?-f) B \ \ c 1:: 0. 15 0.25 R(B) \ \ \ ' ' \ \ ' B ? c 0.35 A. - R{C'J) B - R{.02 \??=?\:??-?---?) ?\ r;r'/ i3 0.05 ' r 1 ' /' / )\ ,' \ \ I . ? \ \/ /1 . V I b . . . . . . a . . . . I . ?c I ?---T-----f 0. 15 0.25 0.35 R(B) I I I I I ( i l lli'Cl?? _!:?- ?- R{0.632) .?_-:_??0:?) _::-_??J l!!!_u? 8.S!_ ____ ?_:-?(0.63?_E..:.??-63?=-? jwtaure 8.40 A - R(CV) B - R(CV) - ;}SE C - R(0.6.32) D - R(0.6.32) - 1SE M-- -- -??-???- ? ... ---. -? -- -- ----- .. ? .. ?-- ??--- ? ? ?--?-?---.. ------ --?-?-? ? - ---.. ------l Coa.-rliOD of the IISI't of Two lrror latlmatlon llethoclt tor CART w1th and without the One Stanclar4 lrror Rule tor Sample Size? of 20 an4 q=l ; 0.03 0.02 0.01 ? I \ I \ I \ I \ I \ I \ I \ Q 1 , r\ ,' ? \ / li P\ \ I \ I \ \ I I \ . B ' \ I , . /.} .c '\ / . - ? . . . . (. . . . c . ? ? . 0.00 '-r-----r----....,..- --,--1 0.05 0.15 0.25 R(B) 0.35 I 1 1 1 1 .. - - - - ?- --.. ---?-?---???--------???? ?-??---???-- .. ---?--?-?? ?- l Comparlton of the IlBra of Two lrror lltlmatlon llethoclt for CART wlth ud without the One st&D.4ard lrror Rule for Sample SJzet of 20 ud q=2 ? 0.025 0.020 0.015 0.0 10 0.005 l\ I \ I \ I \ I \ I \ I \ I \ I \ / 1\ \ p \ \ \ /rx: /' . // . ? ? /' ? . . . ? C ? lr? . . . . . c- ' 0.000 L-,' ?r-- --r- 0.05 0. 15 0.25 R(B) 0.35 Com.-rhon of the IISI't of Two lrror lttlaatlon )(ethoch for cm w1th &D.d without the One Stanclard lrror Rule for Se.mple Slzet of 20 ud q=3 ? 0.03 0.02 0.01 /? ; \ I \ . \ I ,A. \ ! ..... --: :? ... .:.. . \ B . ._, !!> I I . ? ? I . ---- ?? ?'lr--" ?"'""-fl( I I I'/ o.oo -?--r 0.05 0. 15 0.25 R(B) 0.35 ,.,_.,._?!?.-----?-? R????t??? ?.???;! .. ; ?;J haur? ?:??-? ?----?-?-?? .. ??i?!?_;._?(-???-; _ _ ?;J [,? ? 8.43 ____ ____ _ :_?-?:???:?? . . ?-??????-;. ?? 1 56 For l arger samples, Figures 6.44 to 6.46 show trends slightly different to those for bias. For q = 1 and 2, one of the 0-S E estim ates was lowest i n terms of M S E for higher R(B), with R (0.632) using the 1 -SE rul e having a very high MSE, while for h igher R(B) (R(B) ;:::: 0.25), the 1 -SE estimates were more rel iable. For q = 3, the 1 - SE estimates were equivalent to those using the 0-SE rule for R(B) = 0.05 and better than the 0-SE for R (B) = 0. 1 5 and 0.25. For R(B) = 0.35, the 0-SE estimates performed best. These results would tend to suggest that sam ple size plays an i m portant part in determining the choice between the 0-S E and 1 -S E rules for CART. as wel l as if there is any n oise in the data or not. B ased on these results , the recom mendation is to use the 0-SE rule with very small d ata sets for paral lel c l assifi cation prohlems involv ing l i ttle or no noise, while for sequential c lassification problems and some noise in the data . the 1 -SE rule is preferred on the grounds of simplicity. For l arger samples. the 0-SE rule should be used for well separated populations in cases i nvolving paral lel classification problems. For sequential c lassification problems, the 1 -SE rule should be used unless the popu lations are not well separated. Comparin g the tree sizes obtained by using the I -SE ru le instead of the 0-SE rule, showed that overal l , for both methods, n h ad a very large effect (F = 1 2 .94) compared with al l other effects and i nteractions. This impl ies that sam pk size was a major factor i n determining if tree size decreased or not with the use of I - SE rule. In fact. the overal l increase in tree size was one tenninal node l arger for l arge n than for small n . I nvestigation of the p robabi l i ty model stratum of the ANOV A showed there to be no significant method effec t or method by factor interactions, therefore the decrease in tree size resulting from using the 1 -S E rule was no di fferent for either of the two error estimation methods. On the evidence here it would appear that tree size was l ittle affected by using the 0-SE rule i nstead of the 1 -S E rule, certai nly less than suggested by Feng et a ! ( 1 993), though for larger samples with more vaiiables, the increase may be m uch greater. ?-- ???----?- -------?-?-----?- ??-?-?---?l ComparliOD of the 1181'1 of 'lYo lrror lltlmatlon lethocll for CART Ylth u4 Ylthout the One Stuclar4 lrror Rule for Sample Slut of 100 and q=l 0.010 0.009 0.008 0.007 0.006 ? 0.005 0.004- 0.003 0.002 0.001 ???---. . ---?? ..... -?-??-?-.. . . _ .... ?----, ?\ I \ \ \ 8 \ ? --/ \ " / c \ ?;/??' . . ! /\_.< ./ I , " /' \ I ? ? . L. ? . . ?8 . ? :..:..:-Y I j 0.000 . 0.05 0.25 0.35 Flpr? 8.44 - _._ . ... _.? .. . ----.. -?--?----?- ? ? ---?--- .. ?- --- . . . ---- ?? ... -, r .. ? - ? ---- .. . .. .... -. .. .. . -.. . . .. ?? ?--? ..... -.... ..... . . .. -.. .. . . .... -.. - .. . ??- ? ??-, I Comparhon of the ISI'a of 'lYo lrror lltlmatloD Comparlton of the ISI'? of Two lrror lltlmatlon ! )lethocll for CAJT Ylth and Ylthout the OJle letho41 for CART Ylth and Ylthout the One ! Standard lrror Rule tor Sample Slzee of 100 and q::2 Standard lrror Rule for Se.mple Size? of 100 and q=3 I 0.006 0.005 0.004 m 0.003 0.002 0.001 0.000 ------'" --- x?-?--- - .. ?-- ---------?l ! \ I \ I \ I \ I \ /? I 18 ... \ _/-/' ' .C V I Jc-?-::;,..-?/ / . . ... " . . . . ? ??- - - ....... . ? .c? . . . -? . - -?, 0.05 0. 15 0.25 0.35 0.004- 0.003 ? 0.002 I I I o.oo1 I . - -??? - -- - ? ?- -,p1 I /.?I I . . . I 1 . - I ?/ J3 . I .c/ , . ? '/ i ? . I I . ? ? ) 'I i 8 . I . " .... . /" 1/ , "/??? r ?/_ . r::. / ?- I ("_,.;---" '""' I o.ooo ? , --r--???--? 0.05 0. 15 0.25 0.35 ! i ! 6.4.4 Summary The results from this particular study have indicated that when using either the R(CV) or R(0.632) estimates to both select tree size and calculate an "honest" estimate of R(T), the 1 -S E rule should not be used for either small samples or when there is l i ttle noise i n the data ' unless the populations are not well separated. For situations where there exists a lot of noise in the data, the 1 -SE rule is preferred unless the populations are not well separated. 6.5 TRANSFORMATIONS OF ERROR RATES 6.5. 1 Study Plan 158 The p revious sections have deal t with the analyses of untransformed error rates so a d ifference of 0. 1 from R(T) = 0.05 was treated the same as a difference of 0. 1 from 0.35. This seems a somewhat unfair and inappropriate com parison. as suggested by Fitzmaurice et al ( 199 1 ) , in that the former difference should recei ve more weight than the latter (see Section 4.5 .3) . I n this section, two transform ations of the error rates were tried to try and right this i mbalance, namely the logit and proportion transformations. For the logit transformation, R(T) was replaced by LR(T). where LR(T) = ln[R(T) I (1 - R(T))] and its estimate, RCD . by A A A LR(1) = ln[R(T) I ( 1 - R (T))] whi le for the propo11ion transformation, R(T) was replaced by PR(T), where PR(T) = (R(T) - R(T)) I R(T) = 0 and its estimate, R('D, by A A PR(T) = (R(T) - R(T)) I R(T). For example, if RCD = 0.05 and R(i) = 0.2 LR(i) - LR(T) = ln[0.2 I ( 1 -0.2)] - ln [0.05 I ( 1 -0.05)] = 1 .558 PRCD - PR(T) = (0.2 - 0.05) I 0.05 = 3 while if R(T) = 0.35 and R(i) = 0.5 LR(i) - LR(T) = ln[0.5/0.5] - ln[0.35/0.65] = 0.6 1 9 PRCD - PR(T) = (0.5 - 0.35) I 0.35 = 0.429. From these resu lts i t can be seen that using the proport i on transformation has the greatest effect on the en?or rates, as the magnitude of di fferences between (0.2 - 0.05) and (0.5 - 0.35) is 7 and 2 .5 1 7 for the proportion and logit transfonn ations respectively. The two transfo rm ations were used on the R(CY). R(ROT), R(0.632) and R(TEN) error rate estimates calculated in Section 6.2 with the i n tent ion of determining what d ifferences, if any, appeared in the MSE's for all four estimators. 6.5.2 Results As with Fitzmaurice et al ( 1 99 1 ) , the two transformations had very s imilar effects on the patterns of MSE' s. Therefore, only the res u l ts for the proportion transformation are demonstrated here. The results for n = 20 appear in Figures 6.47 to 6.49 and differ from the untransformed resu lts, given in Figures 6 . 1 3 to 6. 1 5 , in a number of respects. Firstly, al l estimators now exhibit the general trend of an in i ti al sharp decrease in MSE going from R(B ) = 0.05 to 0. 1 5 then a gradual decrease from R(B) = 0. 1 5 to 0.35. Note, though, that the MSE's for the R(0.632) estimator were least affected by changes in the values of R(B). For q = 1 and 2, the differences between the R (0.632) and other estimators for R(B ) = 0 .05 were accentuated. As with the untransformed scale, the R(0.632) estimator did best when q = 1 and 2, except for high R(B), while no single estimator was best for q = 3 . 159 For larger samples, a sl ightly differen t trend than appeared wi th smaller samp les is highlighted in Figures 6.50 to 6.52, with the initial decrease in MSE for all estimators, except R(ROT), being not as large as that for small samples and increasing MSE for R (B ) = 0.35. As with smaller samples, though, the pe1formance of each estimator is clearly defined for low R(B ) . R (0.632) did best, in the cases of q = I and 2, for moderate R(B), and h igh R(B) for q = 3. 6.5 .3 Summary The results reported here were very s imi lar to those given by Fitzmaurice et al ( 1 99 1 ) using LOA. A comparison of the MSE' s of the four en?or methods was not greatly affected by the transformations . However, as recorded by Fitzmaurice et a! , the methods now performed best for high R(B) and worst for low R(B) in contrast with the untransformed results where for small samples, MSE was highest for moderate R(B ) . while for larger samples, it was largest for high R(B). 6 CONCLUSIONS In this chapter, simulation study results have shown that the R(0.632) method for estimating the actual en?or rate when using CART performed well in most si tuations for both continuous and categorical data. For continuous data. the R(0.632) clearly had the lowest MSE for smal ler samples and parallel classification problems and marginally lower MSE for smaller samp les and sequential classification problems as well as larger samples and parallel classification problems. Only when the classi fication problem was sequential and the distance between populations was moderate to la rge did o ther techniques outperform R(0.632). For categorical data, most of the trends noted above were also observed. The R(CV) estimator, as for continuous data, was found to be a poor estimator for small samples and highly optimistic for poorly separated populations. For both continuous and categorical data, the R(0.632) estimator (R(ROT)) was found to produce the smallest sized trees. 161 C.,.rlloa of tlul Tnutor_. JiBre of rou lrror ?ttaatlola lletlaot1 tor CdT with Buatlt 81111 of 1.0 u4 plr ProporUoa Tnufo,..tloa ? 1 .0 0.5 G. '0 . . . . . . 0.0 '-r--.,-----....----..J 0.05 0. 15 R(B) 0.25 0.35 A - R(CV) 8 - R(ROT) O.pultoa of tM Tnulo ..... Mill'? of rau lrror IIUMtloa lletWJ for CUT will llulplt 811111 of 'lO u4 .-az Proporttoa Trudor?tloa 1 .0 0.9 0.8 0.7 0.6 llQ ? 0.5 0.4 0.3 0.2 0. 1 0.0 0.05 0. 15 R(B) 0.25 0.35 A - R{CV) 8 - R{ROT} O.puliOil of U.. Trudo,... &'1 of '"' lrror lltlMtlcallltWI for CUT wltllllaplt sr .. of 2.'0 u4 tiiSl Proportlca Tnuloratloa 0.8 4 ? I I 0.7 4 I I I 0.6 -l \ \ \ \ 0.5 4 \ \ ? 0.4 ? . \ \ \ \ 0.3 . ? \\ 0.2 0 . 1 0.0 0.05 0. 15 R(B) 0.25 0.35 A - R(CV} 8 - R{ROT) Fl&ure 8.47 c - R(0.6J2) D - R(TEN) I IFlaure 6.48 C - R(0.632} D - R(TEN) c - R{0.6J2) D - R{TEN} I IFlaure 6.49 I Colapul?oa of t1ae fralufor ... 1181'1 of rcnar Jrror lltl?U? lletlao4e for COT with lluaple 8lal of lOO u4 pll ProporUoa fralufonaatloa 0.2 -fq \ ' \ \ \ \ \ \ \ \ \ \ I! l e ' 51 0.1 i ?. \ \ \ \ \ \ ? ' ' ' o.o ..,_---r----r----T 0.05 0.15 0.25 I( B) 0.35 A - R(tv) B - R(ROT) O.puleoa of tM fnalfo,... ... of rcnar lrror lltJJIIUOD lleUao4l for CAIT with Buaple m.. of 100 ut tal: Proportloa Trultor-uoa 0.2 a 0. 1 ff ' \ \ ' ' \ \ \ \ \ ' \ \ \ \ \ \ \ ?- \ o.o ..,_---r----r---.,... 0.05 0.15 0.25 R(B) 0.35 O.puliOil of U.. Traufor-.4 11111'1 of PO'Ir lrro: IIU..Uoallethocll for CAIT with 8ulple Slsee of 100 &M taS: ProporUoa Traufonaatloa I 0. 15 0. 10 0.? ff \ \ \ \ \ \ \ \ \ \ ' \ \ \ ' \ \ \ \ ? .? \ ' .\,, ...e. ... ... ,. " ...... , ,., . . . C? . . . . . @ o.oo 'T---.....---.....---........ 0.0!5 0. 15 0.2& R(B) 0.35 llpre 1.10 C - R(0.832) D - R(?l lft_.ue e.at A - R(CY) 8 - R(R>l) I A - R(CY) 8 - R{ROT) C - R(0.832) D - R(TEN) 'ftpN e.u C - R{O.tm} D - R{TEH}J In studies comparing the use of the zero and one standard error rules in CART, it was found that the one standard en?or rule should not be used for either small samples for when there is little noise in the data, unless the populations are poorly separated. In all other situations, the one standard etTor rule is the prefeiTed method. Finally, a transformation of the error rate scale produced results which were not unexpected. For large differences between popu lations, that is low Bayes e rror rates, the d ifferences between error estimation techniques were accentuated from the c ase of u ntransformed error rates. Therefore, the techn ique of using the rotation method to select tree size then using the 0.632 method to estimate the actual error rate of a data set is recommended as a quick, easy and reliable technique when used with CART decision trees. However, as mentioned by Crawford ( 1 989), the user should not be constrained to using one method to select the right? sized tree, but instead, with a mixture of common sense and prior knowledge of the domain, make a sensible tree selection. 163 164 CASE STUDIES . 1 INTRODUCTION In this chapter, a n umber of the classification methods outl ined in Chapters 2 and 3 , are appl ied to 24 real-world data sets and compared by means of a number of criteria to be outlined later in this chapter. These data sets are used to either validate or not some of the conclusions reached after the simulation studies undertaken in Chapters 4, 5 and 6. Later in this chapter, a comparison of various tree-based methods is made for one particular data set. .2 PREVIOUS STUDIES Ildiko and Lanteri ( 1 989) compared LOA. QOA. SIMCA (a form of QOA) and CART on four data sets selected from various fields of chem istry. They concluded that no overall method was superior in tenns of prediction error. They also recommend that the type of data structure involved should be explored and then to choose the optimal rule for that particular type of data. If in doubt, several different methods shoul d be used and compared. Lynn and B rook ( 1 99 1 ) undertook an empirical study comparing the performance of traditional discrimination methods with CART on twelve predominantly multivariate normal classification problems, differing in sample size, dimension and modality. Subsequently, it was found that for only three of the data sets was the assumption of equali ty of variances valid, hence, it was decided to use only LOA to compare with CART. For the other nine data sets i nvestigated both QOA and kernel density estimation were canied out. I n all cases, comparisons were made by means of n-fold cross-validation. The find ings of this paper suggested that CART does not perform as well as discriminant analyses in cases where the data set is small and/or simple but does perform at least as well as discriminant analysis in most cases where the data set is larger and/or complex (mul ti -modal, non-normally d istributed and/or high-dimensional ) , espec ial ly where the covariance structure is heterogeneous. 165 Feng et al ( 1993 ) reported a number of papers from the l iterature wh ich have compared various classification methods. However. they noted a number of problems common to the papers referred to above. such as applying different methods to data sets which were not the same, and using old versions of some methods while using the latest versions of others. In their study, Feng et al as mentioned earl ier, compared a large number of c lassification methods for eight data sets involving industrial applications. Generally, the data sets were of much larger sample size and dimension than those used by Lynn and B rook ( 199 1 ). "In conclusion, it seems that there is no one part icular algorithm or one particular method superior to the others on all the data sets. There is indication from our results that which algorithm performs best depends on the characteristics of the data sets. Our work is, however, incomplete in the analysis of such dependent relationships". (Feng et al , 1993, p 5 1 ). B rown et al ( 1 993) compared CART with neural networks, a method which uses mul tiple layers of processes. Each processor produces a weighted non-linear function of the variables. Their comparative studies were carried out for several multi-modal classification problems and found that the two methods produced classification rules with comparable error rates, but CART is preferred for data sets with a large number of irrelevant or noisy variables and when the ratio of sample size to dimension is small. 7.3 COMPARATIVE STUDIES 166 7 .3. 1 Methods and Data Sets Five classification methods were used in this sllldy. involving two categories of methods: ( 1 ) Traditional d iscrim ination methods, which include LOA, QDA and kernel density estimation. (2) Decision tree-based methods, which include CART and FACT. Twenty four data sets, described in Tables 7. 1 and 7 . 2, were chosen for the purpose of compa.J.ison. All the data sets were a convenient selection of published data. Twelve of the data sets were used previously in a comparative study undenaken by Lynn and Brook ( 199 1 ). Those data sets, however, all contained conti nuous variables, nine of which were approximately normally d istributed. The additional twelve data sets used here contain a wider variety of data types, including some data sets involving b inary and ordinal categorical variables. Table 7.1: A. M ammo: B. Marks l : C. M arks2: D. Marks3: E. Digit: F. Birth : Des cri ption of data sets (bl ock 1) This problem involves an attempt to discriminate between women's expe1iences with mammography (three levels) based on five variables, describ ing their knowledge. at t i tude and behavi our towards mammography. Source: R 1 Zapka and Ms D Sports, University of Massachussers. Division of Public Healrh. This involves discriminating between males and females based on their Grade Point Average at university and five pre-university academic variables. Source: Moore and McCabe ( 1 989). This involves discriminating between three groups of students with differen t majors on the basis of the same six variables in B . Source: Moore and McCabe ( 1 989). This involves discrimi nating between six groups of s tudents with different sex and/or majoring subject on the basis of the same six variables in B . Source: Moore and McCabe ( 1989). In this example, the data are generated from a faulty calculator. Each of the seven lights (X 1 , . . . ? X 7) of the digit display has 0. 1 probability of not doing what it is supposed to do. The problem is an attempt to distinguish between the values 1 to 1 0, which occur with equal probability. Source: Breiman er a / ( 1 984). For this set of data, an attempt was made tO d iscriminate between overweight and underweight babies based on various m edical and demographic variables relating to the mother. Source: Hosmer and Lemescho>v ( 1 989). 167 168 G. Family: H. Iris: I. Enures: J. Blood l : K. Pinetree: L. Wheat: This problem was analysed by Kumar ( 1 993), containing information on the type of contraceptive device used by 1 74 Indian couples. An attempt was made to discriminate between the four types of contraceptive device based on the values of twelve vatiables collected from each couple. Source: Family Planning Association of India. This is the classic problem posed by Fisher i nvolving d iscrimination between three al lied species of i1is based on four measurements relating to the size of the iris. Source: Fisher ( 1936). One method of treatment of enuretic children involves an alarm buzzer which wakes the child whenever a bed becomes wet. It was proposed to investigate whether the outcome of the treatment could be predicted from seven measurements, where the possible outcomes are 1 = fail, 2 = relapse after apparent cure and. 3 = long te1m cure. Source: Dr Sylvia Dische (from Hand ( 1 981 )). In the context of genetic counsel l ing, the question of d iscriminating between nOimal and haemophil ia A can?ying women was considered on the basis of two variables. Source: Habbema, Hermans and Van Den Broek ( 1 974). This data consists of the measut'ements, in cen timetres of 60 pinetrees which were felled in three different areas of the forest. For each tree, measurements were taken on four posi tions . The problem involves distinguishing between trees grown in each of the three areas. Source: NZ Foresrry Department. This problem involves discriminating between two varieties of wheat on the basis of six measurements taken from a sample of the two species. Source: Indian Agricultural Research Institute, India ( 1 972). M. Biomass l : N. Biomass2: 0. Biomass3: P. Compl : Q . Employ: R. Urinary: This problem involves discriminating between three different islands on the basis of the growth of spartina biomass and four different chemicals from each of the three islands. Source: Rawlings ( 1 988). This involves distinguish ing between three different types of vegetation cover on the basis of the same five variables in M. Source: Rmvlings ( 1 988). This involves distinguishing between nine different location-vegetation types on the basis of the same five variables in M. Source: Rawlings ( 1 988). Users of the University of London Computer Centre are divided into non-medical and medical users . An attempt is made to distinguish bertween the two based on the numbers of units of computing used under two different operating systems. Source: Hand ( 1 981 ). This problem involves discriminating between three groups of countries (North-Western, Southern and Eastern Europe respectively) on the basis of the percentages of the labour force employed in nine different types of industry. Source: Euromonitor ( 1 979). This p roblem involves d iscr im inating between homosexual and heterosexual males on the basis of two chemical measurements taken from urinary samples. Source: Margolese ( 1 970). 169 170 Table 7.2: S. Blood2: T. Lingual: U. Sparrow: V. Comp2: W. Beetle: X. Nuclear: Description of data sets (block 2) Thi s is the same problem as J except that the two d iscriminating vatiables are logged (base I 0). This problem involves discrimination between a set of 27 children who had an inborn error of metabol i sm known as transient neonatal tyrosinemia (TNT) and a control group of 27 normal children based on the scores of ten psychol ingual va1iables. Source: Peter Mull ins. This problem involves discriminating between sparrows that did or did not survive a severe storm off Rhode Island on 1 February 1 8 89, on the basis of five measurements taken on each bird. Source: Bumpus (1898). This is the same problem as P except that the two d iscrim inatory va1iables are logged (natural log). In this case. an attempt is made to d istinguish between two allied species of flea beetles that were long confused with one another, on the basis of two joint measurements. Source: Lubische>v ( 1 962). This data involves two measurements (population and area) on each of the fifteen largest British cities. excluding London. The fifteen cities used are divided into two classes; those with an estimated fatal i ty rate of 70% or more resulting from a nuclear strike and those with an estimated rate of less than 70%. Source: Laurie ( 1 979). The data sets were first sorted into two blocks after a Chi-squared test for heterogenei ty of variance within groups was carried out for each data set, and where heterogeneity was present to a significant amount, those data sets were assigned to the first block (Table 7 . 1 ) , otherwise to the second (Table 7 .2) . Within each of the two blocks of comparison methods, the d ata sets were ordered by sample size. As some data sets within each block were of similar size, those d ata sets were ordered by dimension. Table 7 .3 lists some details about each of the data sets, including sample size, dimension, number of classes, Chi-squared test for the equal ity of class covariance matrices, equal priors or not, variable types and data struc ture. Variable type refers to whether the variables in the data set are continuous (normal (N)/skewed (S)). ordinal (0), nominal (C), binary (B) or a mixture of the above five types. Data structure refers to how many of the variables in the data were important for the classification process and how many were iiTelevant. A data set could be described as either a paral lel c lassification problem, whereby all of the variables have approximately equal weighting, or as a sequential classification problem where relatively few variables are imp011ant. As outlined in Section 4.3, trad itional discrimination methods should perform best for paral lel classification problems with tree based methods doing better for sequential problems. A third category "mixed"' was also used for problems where a particular data set d id not fit neatly into any of the above two categories . Both stepwise discriminant analysis and CART' s varible ranking technique were employed to determine into which category each of the twenty four data sets should be classed. For each data set, priors proportional to class sample sizes were used, and for all methods, with the exception of F ACT, m odels were obtained which m inimised the misclassification error rate by n-fold cross-val idation, although LOA and CART were also compared using the 0.632 error rate (see further on in this section). Tenfold cross-validation was used with FACT for reasons outlined in Section 4.3. With kernel density estimation, a normal kernel was used with smoothing parameter, h = 0.5. For both CART and FACT, the size below which a node would not be split on was set to 5 . It was decided to use the one s tandard error rule throughout for CART for the p urpose of consistency, although simulation results i n Section 6.4 had shown that the use of the zero standard error mle would be a more reliable estimate of the actual error rate for smaller data sets. 171 1 72 Table 7.3: Data sets and their various characteristics Data Sample Dimension Classes ., Equal x- Set Size priors? A 374 5 3 155 .94 * lf; No B 234 6 2 58.80* ' Yes * * c 234 6 3 107 .62 Yes D 234 6 6 294 .29* * Yes E 200 7 10 2 1 05.08** No ? No F 1 88 8 2 42.76 * * G 1 74 1 2 4 1 1 08.47 No H 1 50 4 .., 1 54 .42'* Yes .) I 1 1 2 7 3 2363.75 * * No * J 75 2 2 1 5 .90 No K 60 4 3 92 . 1 2' * Yes L 54 6 2 43.23 * * Yes M 45 5 .., 1 35 .50** Yes .) N 45 5 3 1 07.8 1 ** Yes 0 45 5 9 400.55 * * Yes * ;I: No p 49 2 2 10 1 .5 1 Q 26 9 3 1 93 .35*' No R 26 2 2 1 1 .22* No s 75 2 2 5 .24 No T 54 1 0 2 52.27 Yes u 49 5 2 0.69 No V 49 2 2 3.92 No w 36 2 2 2. 1 5 Yes X 15 2 2 9.30 No x2 Chi-squared test for homogeneity or tl1e within class covariances N Normally distributed variables S Skewed variables 0 Ordinal variables C Nominal variables B Binary variables ** a < 0.01 * F 1 No_Child 1 0.446 45.69 0.00 2 Wife_Edu 2 0. 1 9 1 13 .27 0.00 3 Husb_Edu 3 0. 1 58 10.48 0.00 4 Age_Baby 4 0.072 4.3 1 0.00 5 Income 5 0.037 2. 1 3 0.09 As there were k = 4 c lasses in the data set, there were four group classifi cation functions, ?(x), and six group separation fu nctions. Di_j ( x ) , c reated. The four group classification functions using SDA are shown in Table 7.9. Pri ors were set proportional to sample size (ppss). Table 7.9: Group classification functions, ?(x), using SDA for the family data L1 (x) L2(x) L3(x) L4(x) Constant -8. 395 1 - I 1 .03 1 2 - 1 1 .930 1 - 1 2 .5852 H usb_Edu -0. 1 483 0.4608 -0.2586 -0.3092 Wife_Edu 1 .7542 0.4235 1 . 3594 2.0982 Income 0.0008 0.00 1 3 0.0024 0.000 1 No_Child 3.054 1 4.7036 3 .6580 3.582 1 Age_Baby 0.02 12 0.05 3 1 0.0360 0.0553 B oth the n -fold c ross-val idation and 0.632 error rates were calcula ted for the SDA c lassi fication rules above, with R(CV) = 0. 195 and R(0.632) = 0. 1 77 . The corresponding v alues for LOA using all sixteen variables were R(CV) = 0.204 and R(0.632) = 0. 1 94, showing that the stepwise model was m ore accurate. As numerous authors in the field of s tepwise d iscrimination and the closely related topic of stepwise regression h ave pointed out, the best q v atiables found from the original sample may not be the best variables over the whole population of values. For instance, the five selected vatiables here may not necessarily be the most i mportant variables for couples throughout all India seeking fami ly planning advice. 185 186 Visually, the classificati on model is very hard to interpret. As the SDA group separation functions i nvolve five variables, it is impossible to depict the full classification m odel. Figure 7 . 1 shows a three-dimensional graph of the three m ost important variables found during the stepwise selection process, that is, No_Child, Husb_Edu, and W ife_Edu. [ Key to Figure 7. 1 : Club = IUD; Star = Tubectomy; B al loon = foam tablets; Diamond = oral pil ls .] From Figure 7 . 1 as well as the table of coefficients in Table 7.9, it is apparent that those who used IUD were characterised as having wives with a higher level of education and a smal l number of children. Those cases where the wife had li ttle or no schooling led to the u se of Tubectomy, while those cases where the wife had a higher level of education and a larger family also used Tubectomy. D istinguishing characteristics for the other two groups are not particularly relevant for these three variables. As the class samp le sizes were d rastically di fferent. it was decided to do another analysis using equal priors. The only di fference that occu rred from Table 7.9 is that the constants h ave change as evidenced from Section 4.5 . Thus. each classification function changed by only one term after alteration of the priors. 7.4.3 CART In contrast with LOA. CART used all twelve vari ables to perform the analysis. The CART tree, using PPS S , is as shown in Figure 7.2 (see Section 3.2 for a description of the CART tree analysis) . To build this tree. the Gini spl itting criterion was employed and the m inimum node size was set at 5 . The l -SE rule was used to select the ?'right sized" tree. The twoing splitting c ri terion was also tried, but. in this instance, produced the same tree as that using Gini, though this is not always the case. From Figure 7.2. it is clear that those cases where the wife had li ttle or no schooling led to he use of Tubectomy. If the wife had a h igher level of education and the h usband was 32 or younger, then IUD was the predominant device used. Of those cases n ot already classified, husband's education was the final spli tting variable used. Those cases w here the husband had little or no education used Tubectomy in the m ain while those left over were m ore than likely to use IUD. For this tree, R(CV) = 0.2 1 3 and R(0.632) = 0.205. ? - - ? - -- -- - -? - - - . - - . - ? ? - - -- ----------, Relationship of Wlfes Education, Husbands Education and Number of Children In the Family to Contraceptive Device Used Husb_Edu 7 . 0 0 4 . 67 2 . 33 - - - -,? - - d _ __ _ 4 . 67 Wife_Edu igure 7 . 1 Q ' ' - ;,."- ,. _ _ _ _ - - - - - - I , - ' ' : 0 . 00 1 . 00 6 . 00 No_Ch i l d 2 . 6 7 C l ub= I UD ; Slar=Tu beclomy; Ba l l o on=foam la b l e ls ; D i amond=ora l p i l l s L_ _________________________________________ _ 188 94 0.09 ? n(t) r(t) Is Wife_Edu < 3 .5? Tubectomy Is Husb_Age < 32.5'l Variable No_Chi ld Wife_Edu Husb A2:e No_Males Wife_Age 5 1 0. 1 8 IUD Relative Importance 1 00 78 63 62 60 7 0.43 IUD n (t) = number o f observations i n the node. r(t) = resubstitution error rate of the node. Is Husb_Edu < 3 .5? 22 0 .36 Tubectomy Circles represent decision nodes which h ave to be split on while rectangles represent terminal nodes which are assigned to a particular class given below the node. Figure 7.2: CART Tree with PPSS for the Family Data I t i s clear from Figure 7.2 that there were a relatively small number of cases who used either foam tablets or oral pi l ls, but were all classi fied as using either IUD or Tubectomy. This would not be a very good situation if these women were going to have adverse reactions when using either IUD or Tubectomy. In an attempt to counter this, another CART tree was grown, this time using equal priors and is given in Fi gure 7 .3 . The CART tree shown here has changed m arkedly from that of Figure 7 .2. The first split is similar to that in Figure 7.2 with cases where the wife had less than three years ed ucation being classified as using Tubectomy. Cases where wives had three or more years ed ucation were next d ivided on the basis of income. Cases where wives had three or more years education were next divided on the basis of income, Figure 7.4 graphically depicts what happens in the CART tree. The sol id line m arks the first split so that all cases to the left of that line were classified as using Tubectomy. The interval l ine denotes the position of the second split. Those cases above the interval l ine were classified as using foam tablets while those below were predicted to be using oral pills. The disturbing feature about this tree, though, was the large number of IUD u sers in the sample, who were all misclassified, which as in the CART tree of Fig ure 7 .2 , could lead to very serious problems if this classification tree was put into practice. As seen in Table 7 .6, the number of misclassified observations from each class with CART was negatively related to sample size when ppss were used. When equal priors were used, only a smal l number of foam tablet and oral pi l l users were misclassified, but, as mentioned previously, al l of those who used IUD were falsely classified. 7.4.4 FACT In contrast with CART which is totally non-parametric, FACT uses F-ratios of between to within class variance to select the partitioning variable, then caiTies out LDA on the selected coordinate axis to partition the data. 1 89 190 85 0.22 Tubectomy (-:;";\ n(t) \J r(t) Is Wife_Edu < 2.5? Is Income < 1 950? 69 0.54 oral pills foam tablets Variable \Vi fe_Edu Husb_Age Income Wife_Age No_ Child Relati ve Importance 1 00 90 77 75 64 Figure 7.3: CART Tree with Equal Priors for the Family Data -------------------- --------- --------------------------------------------------------------? 5000 - 4000 - s 3000 I 0 E 2000 ? ? 1 000 ---i I 0 - 0 Figure 7 . 4 P l o t o f Income aga i n s t W i fe ' s Educa t i on fo r the Fam i l y Data u s i ng CART w i th Equa l P r i o r s Sp l i t 1 I I X X l - - - ?. - - . -? - - . . .' - . . - - . - - l- - - - - - . j. - - - .. - -i - I Sp I i t ?. ' ) X ? X > ? :I ? X '1? 1.. 1 l ? 1 r-- - l l 1 2 3 4 W i f e--Edu :I 'l 0 ., I I :1 l l ,. I '? %1 <1l -?--? I 5 6 7 i - ? I UD f - ? F ? o a m t ab l e t s x - ? T u b e c t o m y o - ? O r a l p i l l s 192 The FACT trees for the family data are given in Figures 7 .5 and 7.6. Figure 7.5 illustrates the c ase of PPSS. The FACT tree is very d ifferem from the CART tree using ppss, s hown in Figure 7.2. Interest ingly, the two variables used to sp l i t on i n this FACT tree were, however, two of the variables used in the SDA functions. This shows the link between LDA and FACT. Although FACT represents its output in a decision tree format, it is basically a parametric technique appl ied i teratively to each descendant subsample of observations. One m ight then expect the fin al classification structu re to be s imi l ar to that produced b y LDA, although this has not happened in this case. As can be seen from Figure 7 .5 . FACT has fa i l ed to correctly c lassi fy any of those people who used either foam tablets or oral pi l ls . Those with three or more ch i ldren were more than likely to have a Tubectomy. whi le those with less than three c h i ldren. and a youngest child who was three years old or more. were also more than l ikely to have a Tubectomy. Those wi th less than three chi ldren and whose youngest c h i l d was less than three years old were m ore than l ike ly to use IUD. These ru les seem to be straigh tforward and com mon sense compared w i th those found in Figure 7 . 2 . which tended to be more sociological in nature. The FACT tree i nd icates that those wi th ei ther l arger fam i l i es or who had not had any c hildren for a wh i le were more than l ikely to use a term inal comracepti ve device. For this tree, R(0.632) = 0. 197 . Figure 7 . 6 gives the FACT tree i n the case of equal pnors . S im i lar to CART and quite d ifferently from LDA. the decision rules have changed qu ite dramatical ly after alteration of the priors. The tree contains only four spl i ts bu t a l l are mul tiway rather than binary splits. The end result is a tree w i th ten term inal nodes. which are represemat ive of all four c lasses ( labelled 1 to 4 on the tree for the sake o f space). Ac tual ly . the class m isclassification error rates were not too d issim i lar. 7 .4.5 KnowledgeSeeker KnowledgeSeeker is an e xam ple of a tree-based approach that uses, in contrast to CART, a statistical significance test ing approach to spl i tt ing . D i fferently from FACT, however, X2 contingency table analysis is used to d ist inguish between the groups. rather than the use of means and covariances as employed by FACT. The method is based firml y on the refinements to AID canied out by Kass ( 1 980). which resulted in the CHAID program. (":;;\ n(t) ? r(t) Is No_ Child < 2.4? 90 0.08 Is Age_Baby < 36.5? Tubectomy 68 1 6 0.26 0.33 IUD Tubectomy Variable No_ Child Wife_Edu No_Male Wife_Age Husb_Age Figure 7.5: FACT Tree with PPSS for the Family Data Relative Importance 1 00.0 98.7 7 1 .9 49. 1 47.5 193 194 8 n(t) Wife Edu ------/ - """? 0<27 1 ?2-4 1 8 448 - c: 1 < 5 048 2 No_Child 3 No_Femal / I ? / I ? < 1 .45 1 .45 - 2 . 5 > 2.5 ? 1 . 1 6 1 . 1 6 - 2.47 > 2.47 GJ G G G G D 4 1 = IUD 2 = Tubectomy 3 = foam tablets 4 = oral pills 3 Husb_Edu 1 / 1 ? GJ 3 GT > ffi 4 3 Variable Wife_Edu No_ Child No_Male Wife_Age Husb_Age Relative Importance 1 00.0 97.2 8 1 .6 35 .6 34.8 Figure 7.6: FACT Tree with Equal Priors for the Family Data The KnowledgeSeeker trees for the fam ily pl anning data are given in Figures 7 .7 to 7 . 1 2, using different pm1itioning methods, significance levels for splitting and splitting variables at the first node. Figures 7 .7 to 7 . 1 0 show the trees with the first split being carried out on the most i mportant variable. Version 2. 1 of KnowledgeSeeker was used for all four trees. An obvious difference is the reduction in tree size when the significance level is decreased. Noticeable too is the slightly smal ler trees produced using heuristic splitting as was suggested i n the KnowledgeSeeker User' s guide. Added to this is the i ncrease in speed using heuristic p artitioning. One trade off in usin g the heuristic parti tioning algorithm is that a sli g h tly different tree may be produced every ti me a new tree is c reated. This wi l l not occ u r if exhaustive partitioning is used. One could argue that a tree may be "pruned" by decreasing the significance level used for splitting, but the tree that was origi nal l y created will only rem ai n unchanged i f exhaustive partitioning is used for tree construction. Comp aring the KnowledgeSeeker trees with those of CART and FACT, it is evident that this method has produced somewhat of a compromise between CART and FACT. No_Chi ld was selected as the m ost i mportant first splitting variable. as did FACT, and stepwise discriminant analysis also showed this to be the most im portant disc1iminating variable. Thus, al l methods which use statistical significance to determ ine the classification rules, whether tree based or n ot, chose No_Chi ld as the m ost im portant splitting variable. Using Figure 7 .9 as the standard KnowledgeSeeker tree, it can be seen that those with either one or two chi ldren were more than likely to use IUD, wi th greater probability if there was only one child rather than two. Of those with three or more children . those with w ives having zero to four y ears education were almost all users of Tubectomy. Those with wives having five or more years education were quite l ikely to u se either foam tablets or oral p ills. No technique for handling priors exists in KnowledgeSeeker, though the tree of Figure 7 .9 c an be seen to h ave correctly c lassified at least some of the observations from every class. The larger trees produced here by KnowledgeSeeker should make the process more robust to large d iscrepancies i n sample size. Reduc tion in the significance levels m ay affect group m isclassifications to some degree, al though this has not occurred for the examples presented here. 195 196 88 . 0% 0 . 0% 4 . 0% 8 . 0% 1 2 5 Legend Device breakdown 32 . 2 ? 1 1 2 3 4 total 84 . 6 % 1 0 . 0% 3 . 8% 11 . 5 % 2 6 Urb(Rur 0 . 0% 1 0 . 0% 0 . 0% 1 iOO . O% 59 . 2 % 1 4 . 0% . 4 . 6 % 1 174 I I No_C?i ld I 53 . 4% 3 4 . 5% 6 . 9% 58 5 . 2% 1 I Age_?aby [ 1 1 3 6 ) I n .n l 1 5 . 8% 5 . 3 % 5 . 3 % 1 38 ! 3 to 9 J . J ? I 92 . 2% 2 . 2 % 2 . 2% 90 Inc9me 1 1 . 2 ? 96 . 5% 0 . 0% ! nr 2 . 4% 1 E,---1 Wife Edu 0 to to ? 7 0 . 0% 100 . 0% , O . O% j 0 . 0% 1 ? [ 1800 ? 2500 ) Figure 7.7: KnowledgeSeeker Tree Method = Heuristic, a = 0.05 8 4 . 5% 0 . 0% 3 . 8% 11 . 5% 2 6 I Ur!:>{Rur 0 . 0% 0 . 0% 0 . 0% 88 . 0% 1 0 . 0 % ' 4 . 0% 8 . 0% 2 5 100 . 0% 1 Legend Device breakdown l 2 3 4 total 32 . 2% 59 . 2% 4 . 0% 4 . 6% 174 No_c:Jrild ? 5 3 . 4'i; i 3 4 . s% I 6 . 9'1; , 5 . 2% 1 5 B Age_?aby ( 1,3 6 ) [3 6 1 132 1 73 . 7% 15 . 0% 1 5 . 8'1; 70 . 0% 5 . 3 '1; 10 . 0% 5 . 3% 5 . 0% 3 8 2 0 0 to 2 0 . 0% 100 . 0% 0 . 0'1; 0 . 0% 72 I ? 0 . 0% 80 . 0% 20 . 0'1; 0 . 0% 5 1 to 9 3 . 3% 1 92 . 2 % 1 . I 2 . 2% 1 I o2;2% Wife Edu ? 0 . 0% 50 .0% 100 . 0% 0 . 0% 0 . 0% 0 . 0% 0 . 0% 5 0 . 0% 5 2 3 3 . 3 % ! 3 3 . 3 % 33 . 3 % 0 . 0 % 3 Figure 7.8: KnowledgeSeeker Tree Method = Exhaustive Part i tioning, a = 0.05 197 7 3 3 .3 /. 33 . 3 ?- O .O't. 3 3 .3 i: 3 198 Legend Device breakdown 1 2 3 4 total 84 . 6% 0 . 0% 3 . 8% 11 . 5 % 2 6 53 . 4% 3 4 . 5% 6 . 9% 5 . 2% 58 Age_?aby 3 2 . 2% 5 9 . 2% 4 . 0% 4 . 6% 174 3 to ? 3 . 3% 92 . 2% 2 . 2% 2 . 2% 90 I I [ 1,3 6 ) [ 36,132 ] [ 3 0 0 1 1800 ) [ 1800 1 2500 ) 73 . 7 % 15 . 8% 5 . 3 % 5 . 3 % 3 8 15 . 0% 70 .0% 10 . 0% 5 . 0% 20 0 to 1 1 . 2% 96 . 5% 0 . 0% 2 . 4% 85 T Wife Edu T 5 to 1 0 . 0? 100 . 0% . 0 . 0% 0 .0% 8 0 20 . 0% 4 0 . 0% 0 . 0% 40 . 0% 5 5 40 . 0% 2 0 . 0% 40 . 0% 0 . 0% Figure 7.9: KnowledgeSeeker Tree Method = Heuristic, a = 0.01 Legend Device breakdo?n 1 ? .n 2 5 9 .2% 3 4 . 0% 4 4 . 6% total 174 0 to l 84 . 6% 0 . 0% 3 . 8% 11 . 5% 2 6 ? 0 . 0% 53 . 4% 34 . 5% 6 . 9% 5 . 2% 58 1 to 9 3 . 3% 92 . 2% 2 . 2% 2 . 2% 90 .___,. Wife"'"Edu ? 0 . 0% SiJ . O% f 80 . 0% 100 . 0% 0 . 0% 0 . 0% 20 . 0% 0 . 0% 0. 0% lOO .0% 0 . 0% 0 . 0% 5 0 . 0 % 0 . 0% 5 5 2 0 . 0% '--- 72 9 3 3 . 3% 3 3 . 3% 33 . 3% 0 . 0% 3 Figure 7.10: KnowledgeSeeker Tree Method = Exhaustive Partitioning, a = 0.01 199 200 100 . 0? 0 . 0% 0 . 0% 1 . 6\ 9 6 . 8\ 1 . 6? 0 . 0\ 63 Legend Device breakdown 1 32 . 2\ 2 59 . 2\ 3 4 . 0\ 4 4 . 6\ total 174 Wife.,.Edu 1 to 1 12 . 9\ 80 . 6% 3 . 2% 3 . 2? 3 1 No_?ale 9 to 0 . 0? 0 . 0\ iJ . J? I 0 . 0\ 1 0 . 0? 1 [ 2 8 , 4 6 ] 0 . 0\ 9 8 . H 1 . 6 \ 0 . 0 \ ? 100 . 0\ 83 . n l 1 J . n i I 0. 0\ I > Q I ? I Inc9ne 0 . 0\ 5 0 . \) \ 5 0 . 0 \ [ 3 0 0 , 1800 ) 14 . ] \ 85 . 7? 0 . 0\ 0 . 0\ 28 0 . 0\ 1 l 2 t..:.__j [ 18 24 ) [ 2 4 , 4 1 1 1 100. 0\ 7 . 7\ 0 . 0\. 92 . Jt 0 . 0\ 0 . 0\ 0 . 0\ 0 . 0\ 2 2 6 4 to 1 63 . 8% 2 1 . 2% 6 . H 80 8 . M , No_C9i lci 1 4 r 2 ? 0 . 0? 63 . 6? 0 . 0'. I 14 . 9 \ I 0 . 0\ 50 . 0\ ? 9 . 1? ? 11 I \ I Age_?iJy Figure 7.11: KnowledgeSeeker Tree: Method = Heuristic, a = 0.01. Second Best Initial Split Used. 7.12: Leqend Device breakdowl 1 g _n 2 ?.n Ut Ut total 174 Wife Age [18 1 24) 75.0t I . Ot O .Ot 20.0t [24 1m 6 0 . Ot JUt 1 . 7t U . Ut 31 20 t?J_qilri No Child - ' i 78 . 9t l I O . Ot I 0. Otj 1 Zl.nJ ? l q lOO . Ot l U . Ot o . at 1 1 i I 7 T j aut ! I 8 . 0t I I 8 . 0t I I _ . O . Ot c__j to I 0 .Ot lOO . Ot I 0 .Ot I 0 .Ot 1 10 [26 411 lUt 71 . 6t Ut I 1 . n 1 ? No Cbild - I ) to i ! uti i ?l . lt J I z . s t l I 2 . \ t ? I [3{)0 ! 1800) [1800 1 2100) Wiie)'.du to to 0 .Ot 20 .Ut lOO .Ot 40 .0t O .Ot O .Ot 0 .Ot w . ut 69 I I I r---1 I 40. Ot I 1 20 . Ut 40 . Ut I IJ . Ut ? KnowledgeSeeker Tree: Method = Heuristic, a = 0.01. Third Best Initial Split Used. 201 202 Figures 7. 1 1 and 7. 1 2 feature the KnowledgeSeeker trees with the first split being carried out on the second and third most impo11ant splits respectively. In Figure 7. 1 1 , the first split was done on Wife_Edu. This tree had an apparent error rate lower than the tree of Figure 5.9, though, as mentioned numerous t imes before, the apparen t enor rate is usually a b iased estimate of the true performance of any classification rule. This is most particularly true of tree-based methods which are data intensive, using subsamples of the Oiiginal data rather than all the data to construct a set o f classification rules . Figure 7 . 1 2 has the first split carried out on Wife_Age, then on number of children in the fami ly. In order to truly test the validity of the KnowledgeSeeker trees, Version 2.0 was used so as to caiTy out some validation procedures. KnowledgeSeeker does not support the use of any form of cross validation so the rotation method was implemented as follows. The data set was first divided into two equal parts. Each half was in turn used as a 1eaming sample and a test sample with the two test sample error rates being averaged to get the rotation estimate of the en?or rate. The techn ique was used on all six KnowledgeSeeker trees previously depicted (Figures 7.7 to 7. 1 2). The rotation, resubstitution and 0.632 error rate estimates for the six trees are given in Table 7 . 10 . with the minimum error rate for each est imate, over all six trees, given in bold. Table 7.10: Rotation, resubsti tution and 0.632 error rates for six KnowledgeSeeker trees for the family data Error Rate Tree Rotation Resubsti tution 0.632 Figure 7.7 0.236 0. 1 44 0.203 Figure 7.8 0.282 0. 1 44 0.22 1 Figure 7 .9 0.2 1 9 0. 1 49 0. 193 Figure 7 . 1 0 0.248 0.2 1 3 0.22 1 Figure 7 . 1 1 0.2 14 0.138 0. 1 86 Figure 7. 1 2 0.184 0-. 1 72 0.180 From Table 7 . 1 0, i t is apparent that the tree in Figure 7 . 1 2 had both the lowest rotation and 0.632 en?or rates, with the tree in Figure 7 . 1 1 having the second lowest error rates of the above type, showing that the best tree was not necessarily the one which produced the greatest separatio n of the classes at the first spli t . The ideal situation is to h ave the independent estimate of the error rate equal to the apparent en?or rate so that one c an be confident in the classi fication rules generated from the learning samp le. Using this as a criterion to c hoose the optimal KnowledgeSeeker tree, the tree i n Figure 7 . 1 2 using heuristic partitioning with a 1 % signi ficance level, a fter in itially spl i tting on Wife_Age was best. Now Wife_Age was noted as only the third most im porta nt variable for spl i tting at the first node. This perhaps shows that this tree was the most accurate classifier for the family data, backed up by the tree h aving the l owest overall independent error rates. One of the principal reasons for the initial creation of decision tree techniques was to identify complex interactions among the data, that could not be detected by parametric methods, such as LOA, without prespecifying the interaction terms directly. Tree-based methods. in general , appear to be very dependent on the in i tial split, that is, the largest main effect for the whole sample. If the first spl i t is n ot particu larly good. then it is unl ikely that further pattitionings of the data set wi ll lead to a robust set of decisi on rules. I f the first split is very good, but does not interact well with the o ther variables, the decision rules may also be rather weak. The interaction s tructure produced by the decision tree is going to be very dependent on the association with the first spl itting variable. If spli tting in itially on a variable, xi , does not lead to as p urer descendant n odes as spli tting on another variable, xi , i t may stil l produce a more robust tree than in itially splitting on xj , if the interaction structure is stronger between xi and the other variables than between xi and the other variab les. In th is example, Wife_Edu was rated the third most i m portant variable or main effect . but in real terms, the in teraction between Wife_Age and the other variables in the data set produced the more accurate set of decision rules usin g KnowledgeSeeker. Spl itti ng i n i tially on ei ther No_Child or Wife_Edu did not produce as good a decision tree as that found through splitti ng first on W ife_Age. 7.4.6 Splus Trees( ) "The tree modelling i nteractive environmen t n ow available in Splus is to the batch mode program C ART as graphical statistical packages such as Splus or JMP are to batc h processing SAS for other statistical methods" (Morton, 1 992, p 76). The method provides an example of the CART approach to decision tree growth, while incorporatin g all the advantages of an interactive environment for tree const1uction, pruning and graphics. 203 204 Figure 7. 1 3 contains the ful ly grown (or overgrown Splus tree) . Similarly to C A RT, and u nlike FACT and KnowledgeSeeker, Wife_Edu was chosen as the initial splittin g variable, thus having the largest main effect for the whole sample. The graph is rather m essy as there are too m any splits, hence terminal nodes in the tree. The residual deviance for this tree was 0.66 1 with R (A ) = 0. 1 49. Obviously, this error rate is optimistically biased. Figures 7. 1 4 and 7 . 1 5 give the deviances for the sequences of subtrees produced by cost-com plexity pru ning and opti m al shrinking respectively. As mentioned earlier (Section 3. 1 1 ) , the deviance always decreases as tree size increases, with the former being a step function because optimal subtrees remain constant between adjacent values of subtree sizes. The rotation method was u sed to determ ine the optimal sized tree, that is, wi th m inimum deviance. The average values for the deviance of given tree sizes. using rotatio nal validation after optimal shrinking are given in Figure 7 . 1 6 . Th is plot would tend to indicate that a tree with six terminal nodes would be optimal but perhaps the use of a standard error rule such as that used in CART would tend to suggest that a tree with four terminal nodes would suffice. The resultant tree chosen by the rotation method is shown in Figure 7. 1 7 . The i ni tial spl it is m ade on "Husb_Edu < 3 .5 " ' as occurred with the CART tree in Figure 7 .2 . The subgroup whose wives had less than four years education we re next split on No_Child whi le the other subgroup was split on Husb_Age. For those cases where the husband was 3 3 or more years of age, a further split was made on ?'Husb_Edu < 3 S'. A final split was m ade on the node whose husbands had four or more years education. spli tting on the number of girls in the family . Those fam il ies w ith either no girls or on ly one girl were m ore than l ikely to use a terminal device (Tubectomy) while those with two or more girls opted for oral pi lls. The residual deviance for this tree was 0.933 with R(A) = 0. 1 62 . R(ROT) = 0.207 and R (0.632) = 0. 1 90. In contrast with the CART tree of Figure 7 .2, at least some of those cases who used oral pi l ls were not misclassified. Notice too. that the 0.632 error rate is lower for this tree than that i n Figure 7 .2, d ue most probably to the much smaller apparent rate for the tree in Figure 7. 1 7 . Figure 7.13: Splus Tree Before Pruning or Shrinking 205 ? deviance 1 00 1 50 200 250 300 N-1 / .. I r i '] F o-i r-1 N-i ,-J ;-i I f1guN 7.1(: Oovl?nce Ylrtu.o Tru Slu for Subtrou Produced by C"'t Compl.,.lty Pruning 1 1 N' "' 0 f-? I; r I ,_ :JI I "' I ?? I I I L - N "' en ?- "" ? CD 0 ,;:; ? 1 20 I deviance 1 40 1 60 1 BO ____[______ I 200 220 I_ I 240 --------- Figure 1.1.5: Dtvlanct Ylr1UI Trt? Slu for Su btut? ProJ uced by Opt! nul Shri nk ing 0 ? c ..., c 0 "' "' 0 ?:... ?" c 0 _ ?, ._, 0 0 [: j e! i N-1 en -I ?? Q)l o-l ;;;-j ;;: -j deviance 240 260 280 300 320 340 360 __...- L? :! "' ..., 0 I 1 0 ? 0 \ r? \ L? w 0 "' L? "' 0 ...____ L? ..., 0 Figurt 7.16: Rot.ttlonJt Deviance versus Tru Slu for Subtrtu Produud by Optimal Shrinking -1 c:: cr CD 0 0 3 '< 0 i;i3 '"0 en 0 2 cr . . . , Av t l , . . . , tv I(mi, ni) E(xj) gain(xj) p(ilt) p U/t) 236 14 14 1 5 1 5 1 5 1 5 1 6 1 7 1 7 2 1 2 1 2 1 26 26 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 34 34 34 34 34 34 35 36 37 37 37 37 37 37 4 1 4 1 regularisation p arameters used in regularised discriminant analysis non-metric discriminant analysis number of class m sample points with values less than or equal to x cumulative distribution function estimate density function estimate smoothing weighting function used in kernel density estimation kernel density d iscriminant function Kth nearest neighbour method distance function between Xj and x Automatic Interaction Detector THeta AID CHi-squared AID sample size, sample size at a node proportion of observations not from the class with the largest n u m ber of observations at each node subset of the data. node total sum s of squares for node t response i n node t between g roup sum of squares found after splitting on variable xj mean of responses in the first su bgroup of node t m aximum between group sums of squares over all variables total sum of squares for the whole data set parameters used in AID stopping rules Theta spl itting cri tetion total number of observations at node t total number of misclassified observations in the ith split group Delta spl i tting criterion propottio n of observations from class j in node t propottio n of observations from class j in split group 1 parameters used in THAID stopping rules the number of observations from classes I and 2 respectively (103 on ly) expected information needed to classify an object using an ID3 tree. v distinct categories of a variable descendant nodes of t information required to classify an object using a subtree from lj expected infmmation required for trees partitioned on variable xj at the root node information gained through branching on xj probability an object can be assigned to class i at n ode t probability an object can be assigned to class j at node t i(t) s .ili(s, t) PL, PR cl , c2 (s/t) Tmax T Ra(T) R(AT) 0: se(R(i)) * s S ? J p(s * , sj) PLL(s* , sj) t L p(t) sj IV(xj) IK L,J L se(L) E Yij D(Jli, Yi) D()l; y) D(jl.L, llR; y) .ilD DuCT) y(node) 42 estimated probability of m i sclassification under the Gini index 42 a split 42 Gini splitting criterion 42 proportion of observations at node t sent left or right respectively by the split 42 estimated probability of m isclassification for the observations sent left by the split 42 amalgamation of classes, superclasses 42 twoing splitting criterion 4 3 a fully grown tree 43 a sub tree of T max 43 cost-complexity measure for T 43 resubstitution error rate for T 43 cost-complexity parameter 43 number of te1minal nodes in T 43 subtree that minimises Ra(T) 44 i ndependent estimate of the error rate, test sample or cross-validation error rate estimate 44 standard error of R(i) 44 optimal parti tion of a node t into tL and tR 44 split carried out on va1iable x i 45 probability that sj sends the cases in t the same way as s * 45 probability that both s * and sj send the cases in t, to the left 45 set of observations sent left by s 45 probability that an observation is in node t 45 sun?ogate spl it on variable xj 47 correctness of the answer from splitting on xj 48 total number of observations in subtree T 48 total number of misclassified observations in subtree T 48 pessimistic view of the number of misclassified observations in subtree T 48 standard error of L 48 number of observations misclassified by the best terminal node within T 56 probability that the ith response falls in the jth class 56 deviance function for an observation Yi 56 deviance of a node. sum of the deviances of all observations in the n ode 56 combined deviance of the two descendant nodes 56 difference between deviance of a node and the combined deviance of the two descendant nodes 57 cost-complexity measure for T using deviances 57 fitted value for each node 237 y(parent) y(node) y(parent) R (T) R(E) R(TS) R(A) R(H) R(ROT) R(CV) wJ R(J) 8i(x) R (PP) cj c-J Q(Cj , Cj) w * * x l , . . . , \ R *(A) R*(T) * Pjb wb wB R (B OOT) R(0.632) R (E) wo.632 R (GCV) R (-r) Z? ? lJ e R 238 57 shrunken fitted value for the node's parent 57 shnmken fitted value for the node's parent 57 fitted value for the node 's parent 66 actual or true error rate, expected probability of m isclassification when class condi tional density functions are known 67 the expected erTOr rate for a learning sample of a given size 68 test sample en?or rate estimate 68 apparentlresubstitution error rate estimate 69 holdout enor rate estimate 69 rotation en?or rate estimate. twofold cross-validation error rate estimate 69 n -fold cross-validation error r:ate estimate, leave-one-out estimate, U estimate 69 apparent error rate for the leaming sam ple with the jth observation omitted 70 average jackknife enor rate estimate 70 jackknife estimate of the bias of the apparent error rate 70 j ackknife en?or rate estimate 70 posterior probability that x belongs to class i 70 posterior probab i l ity en?or rate estimate 7 I class of observation xj 7 1 predicted class of observation xj 7 1 (0, 1 ) loss function 7 1 true bias involved in usrng the apparent error rate as an estimate of the actual error rate 7 1 random sam ple o f observations drawn with replacement from the learning sample, bootstrap sample 7 1 apparent error rate of the bootstrap sample 72 ac tual error rate of the bootstrap sample 72 resampled proponion of observations in the bootstrap sample 72 bias involved in using the apparent enor rate of the bootstrap sample 72 bootstrap estimate of the bias of the apparent error rate 72 bootstrap error rate estimate 72 0.632 error rate estimate 72 average error rate for all observations not in the boostrap sample 73 0.632 estimate of the bias of the apparent en?or rate 73 g-fold cross-validation en?or rate estimate 73 weighted estimate of the g-fold cross-validation and apparent error rates 74 standardised distribution with mean zero and standard deviation one 74 combination or prior probabilities and covariance matrices factor 76 classification method factor ANOVA seR(CV) PPSS R(ilj ) R(TEN) R(ACV) R(AR) R(AT) R(T) MSE C? I Pij rijk q p pi bias COUNT LR(T) LRCD PR(T) PR(t) m\t) 76 analysis of variance 82 standard error of the misclassification cost 93 priors proportional to sample size 94 group/class misclassification error rates, proportion of observations from class i classified as class j 108 tenfold cross-validation error rate estimate 108 apparent enor rate for CART trees chosen by n-fold cross-validation 1 08 apparent enor rate for CART trees chosen by rotation 1 08 apparent en?or rate for CART trees chosen by tenfold c ross-validation 1 08 Any error rate estimating the actual error rate 108 expected value of the squared distance between an error rate estimate and the actual error rate 1 1 5 category i in a categmical variable 1 1 6 probability of getting a response xj = l for class i, probabi lity pattern factor 1 1 6 correlation coefficient between xj and xk for class i, correlation factor 1 3 1 factor combination o f means, dimension and correlation 1 3 1 population con?elation coefficient 1 3 1 population con?elation mattix for class i 1 32 expected value of the di fference between the actual error rate and the error rate estimate 1 32 proportion of samples for each factor combination in which the estimated enor rate was less than the actual error rate 1 58 actual error rate after a logit transformation 1 58 enor rate estimate after a logit transformation 1 58 actual error rate after a proportion transformation 1 58 error rate estimate after a proportion transfon ation 230 reciprocal entropy index 239 BIBLIOGRAPHY Aitchison, J. and B rown, J.A.C. ( 1 957) The lognormal distribution with special reference to its uses in economics, Cambridge University Press. A itchison, J., Habbema, J .D. F. and Kay, J.W. ( 1 977) A critical comparison of two methods of statistical d iscrimination, Applied Statistics, 26, pp 1 5-25. Anderson, J.A. ( 1 966) Some nonparametric mul tivariate procedures based o n statistically equivalent blocks, in "Multivariate Analysis" , P.R . Krishnaiah (ed), New York: Academic Press, pp 5-27. Ashikaga, T. and Chang, P.C. ( 1 98 1 ) Robustness of Fisher's linear discriminant function under two? component m ixed n ormal m odels, Journal of the American Statistical Associ ation, 76, pp 676-680. Assael, H. ( 1 970) Segmenting m arkets by group purchasing behaviour: an application of the A.I.D. technique, Journal of Marketing Research, 7, pp 1 53- 1 58 . Bellman, R.E. ( 1 96 1 ) Adaptive Control Processes, Princeton, New Jersey: Princeton University Press. Belson, W.A. ( 1 959) Matching and prediction on the p1inciple of biological classification , Applied Statistics, 8, pp 65-75 . B iggs, D . , d e V i lle, B . and Suen, E. ( 1 99 1 ) A m ethod of choosing m ultiway partitions for classification and decision trees. Journal of A ppied Statistics , 18. pp 49-62. B radford, E. ( 1 993) Tree-based m odels in S, New Zealand Statistician, 28, pp 36-5 1 . Breiman, L. ( 1968) Probability, Reading, Massachussets : Addison-Wesley. B reiman, L. ( 1 97 8) Description of chlorine tree development and use, Technical Report, S anta Monica, California: Technology Service Corporation. B re iman, L. and Friedman, J .H. ( 1 988) Com mentary on "Tree-struc tured c lassification via generalised discriminant analysis" , Journal o f the American S tatistical Association, 83, pp 725-727. B re iman, L., Friedman, J . H. , Olshen, R .A. and S tone. C.J. ( 1984) Classification and Regression Trees, Belmont, California: Wadsworth. Bumpus, H.C. ( 1 898) The el imination of the unfit as i l l ustrated by the introduced species, Passer Domesticus, B iological Lectures, Matine B iology Laboratory, Woods Hole, 1 1 th Lecture. Buntine, W.L. ( 1 992a) Learning classification trees, Statistics and Computing, 2, pp 63-73. B un tine, W.L. ( 1 992b) Tree classification software, Presented at the Third National Technology Transfer Conference and Exposition, Baltimore. B untine, W.L. ( 1 993) Personal Communication. Buntine, W.L. and Caruana, R. ( 1 993) Introduction to IND version 2. 1 and recursive partitioning, Technical Report, Moffet Field, California: NASA Ames Research Centre. Bunti ne, W.L. and Niblett, T. ( 1 992) A further comparison of spli tting rules for decision- tree . induction, Machine Learning, 8, pp 75-85. 239 Chernick, M.R., Murthy, V.K. and Nealy, C.D. ( 1 9 85) Application of bootstrap and other resampling techniques: evaluation of classifier performance, Pattern Recognition Letters, 3, pp 1 67-178. Chernick, M.R., Murthy, V.K. and Nealy, C.D. ( 1 986) CorTection note to " Application of bootstrap and other resampling techniques: evaluation of classifier performance" , Pattern Recognition Letters, 4, pp 1 33- 1 42 . Chittineni, C.B . ( 1 977) O n the estimation of probabi l i ty of eiTor, Pattern Recognition, 9, pp 1 9 1 - 196. Chou, P.A. ( 1 99 1 ) Optimal partitioning for classification and regression trees, IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, pp 340-354. Ciampi, A., Hogg, S.A., McKinney, S. and Thiffault, J . ( 1 988) RECPAM : a computer program for recursive partition and amalgamation for censored survival data and other situations frequently occuring in biostatistics, Computer Methods and Programs in Biomedicine, 26, pp 239-256. Ciampi, A., Lawless, J .F. , McKinney, S . M . and Sing hal, K. ( 1 988) Regression and recursive partition strategies i n the analysis of med ical sur v ival data, Journa l of Clinical Epidermiology, 41, pp 737-748. Ciampi, A . , Thiffault, T. , N akache, J-P. and Assel ain, B. ( 1 986) Stratification by stepwise regression, coiTespondence analysis and recurs ive partition : a comparison of three methods of analysis for survival data with covariates, Computational Statistics and Data Analysis, 4, pp 185-204. Ciampi, A. , Schiffrin. A, Thi ffault, J . , Quintal, H .. Weitzner, G .. Poussier, P. and Lalla, D. ( 1 990) Cluster analysis of an insulin-dependent diabetic cohort towards the definition of clinical subtypes, Joumal of Clinical Epidermiology, 43. pp 70 1 -7 1 5 . Clark, L.A. and Pregibon, D . ( 1 992) Tree-based models, in " Statistical Models i n S " , J.M. Chambers and T.J. Hastie (eds), Pacific Grove, California: Wadsworth and B ro oks/Cole, pp 377-4 19. Coc hran, W.G. ( 1 968) Com mentary on " Estimation of error rates in d iscriminant analysis " , Technometrics, 10 , p p 204-205. Cook, E .F. and Goldman, L. ( 1 984) Empiric comparison of multivariate analysis techniques : advantages and disadvantages of recursive partiti oning analysis, Journal of Chronic Diseases, 37, pp 72 1 -73 1 . Cover, T.M . and Hart, P.E. ( 1 967) Nearest neighbor pattern classification, IEEE Transactions on Information Theory, IT-13, pp 697-722. Crawford, S.L. ( 1 989) Extensions to the CART algorithm, Journal of Man-Machine Studies, 31, pp 197-217 . Crawford, S .L. and Souders, S.K. ( 1 990) A comparison of two conceptual clustering a lgorithms, International Journal of Pattern Recognition and Artificial Intelligence, 4, pp 409-420. Crawford, S .L. , Fung, R.M. and Tse, E. ( 1 989) Data-driven assessment and decision m aking, in "Expert Systems in Economics, Banking and M anagement" , L.F. Pau et al (eds), Amsterdam: North-Holland, pp 399-408. de Ville, B . ( 1990) Applying statistical knowledge to database analysis and k nowledge base construction, Proceedings of the Sixth Conference on Artificial Intelligence Applications, pp 30-36. 240 de Ville, B. ( 1 994) Personal Communication. Doyle, P. ( 1 973) The use of Autqmatic Interaction Detector and similar search procedures, Operational Research Quarterly, 24, pp 465-467. Doyle, P. and Fenwick, I . ( 1 975) The pitfalls of AID analysis, Journal of Marketing Research, 12, pp 408-4 13 . Efron, B . (1979) B ootstrap methods: another look at the jackknife, Annals of Statistics, 7 , pp 1 -26. Efron, B . ( 1 982) The Jackknife.the Bootstrap and Other Resampling Methods, Philadelphia: Society for Indust1ial and Applied Mathematics. Efron, B. , ( 1 983) Estimating the error rate of a prediction rule: Improvement on cross-validation, Journal of the American Statistical Association, 78, pp 3 1 6-33 1 . Einhorn, H.J. ( 1972) -Alchemy i n the behavioural sciences, Public Opinion Quarterly, 36, p p 367- 378. Euromonitor ( 1 979) European Marketing Data and Statistics. London: Euromonitor Publications. Feng, C., Sutherland, A, King, R. , Muggleton. S. and Henery, R. ( 1 993) Comparison of machine learning classifiers to statistics and neural networks. Personal Comm unication. Fielding, A. ( 1 977) Binary segmentation: the Autom atic Interaction Detector and related techniques for exploring data structure, in " The Analysis of Su rvey Data, V olume 1 " , C .A. O'Muircheartaigh and C.Payne (eds). London: John Wiley, pp 22 1 -257. Fisher, R.A. ( 1 936) The use of multi ple measurement in taxonomic problems. Annals of Eugenics, 7, pp 1 79- 1 88. Fitzmaurice, G.M. , Krzanowski. W.J. and Hand, D.J . ( 1 99 1 ) A Monte Carlo study of the 632 estimator of en?or rate, Joumal of Classification, 8. pp 239-250. Fix, E. and Hedges, J .L. ( 195 1 ) Discriminatory analysis, nonparametric discrimination: consistency properties, Report No. 4, Project No. 2 1 -49-004, U S AF School of Medicine, B rooks Air Force B ase, Randolph Field, Texas. Friedman, J.H. ( 1 977) A recursive partitioning decision rule for nonparametric classification, IEEE Transactions on Computers, E-26, pp 404-408. Friedman, J.H. ( 1 989) Regularised discriminant analysis, Journal of the American S tatistical Association, 84, pp 1 65- 1 7 5. Fukunaga, K. ( 1 990) Introduction to S tatistical Pattern Recogni tion, Second Edition, Boston: Harcourt B race Jovanovich. Ganesalingam, S. and Lynn, R.D. ( 199 1 ) Posterior probability based estimator for the overall error rate associated with a linear discriminant function, Occasional Publications in Mathematics and Statistics, 23, Massey University. Ganeshanandam, S. and Krzanowski, W.J. ( 1 990) En?or rate estimation in two-group discriminant analysis using the l inear discriminant function , Journal of S tatistical Computation and Simulation, 36, pp 1 57- 1 75. Glick, N. ( 1 978) Additive estimators for probabilities of COITect classification, Pattern Recognition, 10, pp 2 1 1 -222. . Gnanadesikan, R. ( 1 977) Methods for Statistical Data Analysis of Multivariate Observations, New York: John Wiley. 241 Gong, G. ( 1 986) Cross-validation, the jackknife and the bootstrap: excess error estimation in forward logistic regression, Journal of the A me1ican Statistical Association, 81, pp 1 08- 1 1 3. Gordon, L. and Olshen, R.A. ( 1 980) Consistent nonparametric regression from recursive partitioning schemes, Journal of Multivariate Analysis, 10, pp 6 1 1 -627. Grajski, K.A. , Breiman, L., Viana Di P1isco, G. and Freeman, W.J. ( 1 986) Classification of EEG spatial patterns with tree-structured methodology: CART, IEEE Transactions on B iomedical Engineering, 33, pp 1076- 1086. Habbema, J.D.F., Hermans, J. and van den Broek, K. ( 1974) A stepwise discriminant analysis program using density estimation, in "Compstat 1 974", G.B ruckmann, F.Ferschl and L.Schmetterer (eds), Vienna: Physica-Verlag, pp 1 0 1 - 1 10. Hadj imichael, M. and Wasilewska, A. ( 1 993) Interactive inductive learning, International Journal of Man-Machine Studies, 38, pp 1 47- 1 67. Hand, D.J. ( 198 1 ) D iscrimination and Classification. New York: John Wiley. Hand, D.J. ( 1986) Recent advances in en?or rate estimation. Pattern Recognition Letters, 4, pp 335- 346. Hanson, S.J. ( 1990) What connectionist models leam : learning and repesentation in connectionist neural networks, Brain and Behavioral Sciences, 13, pp 47 1-5 1 1 . Heald, G.I. ( 1 972) The application o f the Automatic Interaction Detector (A.I .D.) programme and multiple regression techniques to the assessment of store selection and site performance, Operational Research Quarterly, 23, pp 445-457. Hosmer, D.W. and Lemeschow, S. (1989) Applied Logistic Regression, New York: Wiley. I ldiko, E.F. and Lanteii, S. ( 1 989) Classification m odels: discriminant analysis, SIMCA, CART, Chemometrics and Intelligent Laboratory Systems, 5. pp 247-256. Jennrich, R.I. ( 1 977) Stepwise d iscriminant analysis. in " S tatistical Methods for Digital Computers, Volume 3", K.Enslein , A.Ralston and H. S.Wilf (eds), New York: John Wiley, pp 76-95. Kanal, L. ( 1974) Patterns in pattem recognition: 1968- 1974, IEEE Transactions on Information Theory, IT-20, pp 697-722. Kass, G.V. ( 1 975) Signi ficance testing i n auto m atic interaction detection (A.I.D.) , Applied Statistics, 24, pp 17 8- 1 89. Kass, G.V. ( 1 980) An exploratory technique for investigating large quantities of categorical data, Applied Statistics. 29, pp 1 1 9- 1 27. Kendall , M.G. ( 1 9 66) Discrimination and classification, in "Multi variate Analysis", P.R. Krishnaiah (ed), New Y ork: Academic Press, pp 1 65- 185. Konishi, S. and Honda, M. ( 1 990) Comparison o f procedures for estimation of error rates in discriminant analysis u nder non-normal populations, Journal of S tatistical Computation and S imulation, 36, pp 1 05 - 1 1 5 . Krzanowski, W.J. ( 1 977) The perfonnance o f Fisher's l inear discriminant function under non? optimal conditions, Tec hnometrics, 19, pp 1 9 1 -200. Kumar, K. ( 1 993) Applicatio n of discriminant analysis and Mahalanobis distance to the family planning data, Personal Communication. ? 242 Kumar, K. and Srivastava, S . ( 1 989) An analysis of the profile of accepters of fami ly welfare programme i n India, Paper presented in the International Union of the Scientific S tudy of the Population Conference, New Delhi . ? Lachenbruch, P.A. ( 1967) An almost unbiased method of obtaining confidence intervals for the probability of misclassification in discriminant analysis, B iometrics, 23, pp 639-645. Lachenbruch, P.A. ( 1 975) Discriminant Analysis, New York: Hafner Press. Lachenbruch, P.A. and M ickey, M .R. ( 1 968) Estimation o f error rates in discri minant analysis, Technometrics, 10, pp 1 - 1 1 . Lachenbruch, P.A., Sneeringer, C . and Revo, L.T. ( 1 973) Robustness of the l inear and q uadratic d iscrim inant function to certain types of non-norm ali ty, Communications in Statistics - Simulation and Computation, 18, pp 39-56. Laurie, P. ( 1 979) Beneath the City S treets, London : Granada Publ ishing. LeBlanc, M. and Crowley, J. ( 1 993) Survival trees by goodness of split, Journal of the American S tatistical Association, 88, pp 457-467. Lehmann, E.L. ( 1 975) Nonparametrics : Statistical Methods B ased on Ranks, San Francisco: Holden-Day, Inc. Lesaffre, E . , Wi lliams, J .L. and Al bert. A. ( 1 989) Estimation of error rates in multiple group logistic discrimination analysis. The approximate leaving-one-out method, Communications i n Statistics - Simulation and Computation, 18. pp 2989-3007. Loh, W.Y. and Vanichsetakul , N. ( 1 988) Tree-structured c lassification via generalised discri minant anal ysis, Journal of the American Statistical Association, 83, pp 7 1 5-725. Lubischew, A.A. ( 1 962) On the use of discriminant functions in taxonomy, B iometrics, 18, pp 455- 477 . Lynn, R . D . and Brook, R.J. ( 1 99 1 ) Classification by decision trees and discriminant analysis, New Zealand Statistician, 26, pp 1 8-26. Lynn, R.D. , B rook, R.J. and Arnold, G.C. ( 1 993) A comparison of four classification methods: l inear and quadratic discriminant analysis, CART and FACT, Mathematical and Information Sciences Report, Series B: 1 , Massey Universi ty. McLachlan , G .J. ( 1 974) Estimation of the enors of misclassifi.cation on the criterion of asymptotic mean square eiTor, Technometrics, 16, pp 255-260. McLachlan, G.J. ( 1 977) A note on the choice of a weighting functi on to give an efficient method for estim ating the probability of misclassification, Pattern Recognition, 9, pp 1 47 - 1 49. McLachlan , G .J. ( 1 986) Assessing the performance of an a l location rule, Computing a n d Mathematics with Applications, 12A, pp 26 1 -272. McLachlan , G.J. ( 1 987) Error rate estimation in d iscrimi nant analysis: recent advances, in " Advances in Multivariate Statistical Analysis", A.K. Gupta (ed), Amsterdam: Dordrecht, p p 235-252. McLachlan, G.J. ( 1 992) Discriminant Analysis and Statistical Pattern Recognition, New York: John . Wiley. Mabbett, A., S tone, M. and W ashbrook, J. ( 1 980) Cross-validatory selection of binary variables i n differential d iagnosis, Applied Satistics, 29, pp 1 98-204. 243 Margo1ese, M.M. ( 1 970) Homosexuality: A new endocrine corTelate, Hormones and Behaviour, 1 , p p 1 5 1 - 155. Marks, S. and Dunn, O.J. ( 1974) Discriminant functions when covariance m atrices are unequal, Journal of the American Statistical Association. 69. pp 555-559. Marshall, R.J. ( 1 986) Partitioning methods for classi fication and decision making in medicine, Statistics in Medicine, 5. pp 5 1 7-526. Marshal!, R.J. ( 1990) Non-hierarchical partition analysis for classification and for identification of subgroups in medical research, Personal Communication. Messenger, R.C. and Mandell, M.L. ( 1 972) A modal search technique for predictive nominal scale multivariate analysis, Joumal of the American Statistical Association, 67, pp 768-772. M ichie, D. ( 1 986) Current developments in expert systems, in "Applications of Expert S ystems, Volume 1 ", J .R. Quinlan (ed), Wokingham : Addison-Wesley, pp 1 37- 1 56. M ichie, D. ( 1989) Problems of computer-aided concept formation, in "Applications of Expert Systems, Volume 2" , J.R. Quinlan (ed), Woki ngham : Addison-Wesley, pp 3 1 0-333. Mingers, J. ( 1 989) An empi rical comparison of selecti on measures for decision- tree induction, Machine Leaming, 3, pp 3 1 9-342. M initab Inc . . Version 9. State Col lege , PennsylYania. Moore, D.H. IT. ( 1 973) Evaluation of five discrimination procedures for binary variables, Journal of the American Statistical Association. 68. pp 399-404. Moore, D.S. and McCabe, C . P. ( 1 989) In trod uction to the Practice of Stati stics, San Francisco: Freeman . Morgan, J.N. ( 1 990) A condi tional analysis o f movers' housing responses, Journal of Economic Behaviour and Organisation. Morgan J .N. ( 1 993) Personal Communication. Morgan , J .N. and Messenger. R.C. ( 1 973) THAID: a Sequential Search Program for the Analysis of Nominal Scale Dependent Variables. Ann Arbor: Insti tute for Social Research. University of Mich igan. M organ, J .N. and Sonquist, J . A. ( 1 963) Problems in the analysis of survey data, and a proposal , Journal of the American S tatistical Association. 58. pp 4 1 5-434. Morton, S .C. ( 1 992) Personal crunching : new advances in statistical dendrology, Chance, 5, pp 76- 79. Muxworthy, D.T. ( 1972) Review of AID Ill, British Sociological Association Maths, Statistics and Computing Applications Group Newsletter. 9. 01iver, J.L. ( 1 993) Decision graphs - an extension of decision trees, to appear in Artificial Intell igence and Statistics, 1 4. O'Neill , J .L. ( 1986) Knowledge acquisition for radar classification, in "Applications of Expert Systems, Volume 1 " , J.R. Quinlan (ed). Wokingham: Addison-Wesley, pp 1 84- 199. Pawlak, Z. , Wong, S .K.M. and Ziarko, W. ( 1 988) Rough sets: probabi listic versus deterministic approach, International Joumal of Man-Machine Studies, 29, pp 8 1 -95. 244 Picard, R.R. and B erk, K.N. ( 1990) Data splitting, Ameiican S tatistician, 44, pp 1 40- 1 47 . Prada Sanchez, J .M. and Otero Cepeda, X.L. ( 1 989) The use of smooth bootstrap techniques for estimating the error rate of a prediction rule, Co mmunications in Statistics - Simulation and Computation, 18, pp 1 1 69- 1 1 86. Quenouille, M. ( 1 949) Approxi mate tests of correlation in ti m e series, Journal of the Royal Statistical Society Series B, 1 1 , pp 1 8-84. Quenouille, M. ( 1 956) Notes on bias estimation, Biometrika, 43, pp 356-360. Quinlan, J.R. ( 1 979) Discovering rules from large collections of examples: a case study, in " Expert Systems in the Micro Electronic Age" , D. Michie (ed), Edinburgh: Edinburg h University Press. Quinlan, J.R . ( 1 983) Learning efficient classification procedures and their application to chess end games, in "Machine Learning, an Artificial In te l ligence A pproach, Volume 1 " , R. S . Michalski, J.G. Carbonell and T.M. Mitchell (eds). Palo Alto, California: Tioga, pp 463-482. Quinlan, J.R. ( 1 986) Induction of decision trees. Machine Leaming, 1, pp 8 1 - 1 06. Quinlan, J.R. ( 1 987) Simplifying decision trees. Intemational Joumal of Man-Machine Studies, 27, pp 22 1 -234. Quinlan, ( 1 993 ) Comparing connectionist and symbolic learning m ethods, in "Com putational Learning Theory and Natural Learning Systems : Constraints and Prospects" , S . Hanson, G . Drastal and R . Rivest (eds), Cam b1idge, Massach ussets: M IT Press. Quinlan, J .R. and Rivest, R.L. ( 1 989) InfeiTi ng decision trees using the minimum description length principle, Information and Computation. 80, pp 227-248. Quinlan, J.R . . Compton, P.J. , Horn, K.A. and Lazarus. L. ( 1986) Inductive knowledge acquisition : a case study, in "Applications of Expert Systems, Volume 1 " , J.R. Quinlan (ed), Wokingham: Addison-Wesley, pp 1 57 - 1 73 . Raveh, A.A. ( 1 989) A nonmetric approach to linear discriminant analysis, Journal of the American Statistical Association, 84. pp 1 7 6- 1 83. Rawlings, J .O. ( 1 988) Applied Regression: A Research Tool, Belmont, California: Wadsworth. Rutter, C., Flack, V. and Lachen bruch. P.A. ( 1 99 1 ) Bias in error rate estimates in discriminant analysis when stepwise variable selection is em ployed, Communications in S tatistics, 20, pp 1 -22. Safavian, S .R. and Land grebe, D. ( 1 99 1 ) A s urvey of decision tree classification methodology, IEEE Transactions on Systems, Man and Cybernetics. 21 , pp 660-674. Salahuddin and Hawkes, A.G. ( 1 99 1 ) Cross-validation in stepwise regression, Communications in S tatistics - Theory and Methods, 20. pp 1 163- 1 1 82. Sankar, A. and Mammone, R.J. ( 1 99 1 ) Neural tree networks, in "Neural Networks: Theory and Applications", R.J. Mammone and Y.Y. Zeeve (eds). Boston : Harcourt B race Jovanovich , pp ? 2 8 1 -302. SAS Institute Inc., Version 6.04, Cary, North Carolina. Schaffer, C. ( 1 993) Selecting a classification method by cross-validation, Personal Communication. ? 245 Schwartz, S . , Wiles, J., Gough, I. and Phil l ips , S . ( 1 993) Connectionist, rule-based and B ayesian decision aids: an emp irical compaxison, in "Artificial Intelligence Frontiers in S tatistics", D.J. Hand (ed), London: Chapman & Hall, pp 264-278 . Seber, G.A.F. ( 1 984) Multivariate Observations, New York : John Wiley. Segal , M.R. ( 1 988) Regression trees for censored data. B i ometxics. 48, pp 35-47 . Shlien, S . ( 1 990) Mul tiple binary decision tree c lassifiers. Pattern Recognition, 23, p p 757-763 Snapinn, S . M . and Knoke, J . O . ( 1 985) An evalu ation of smoothed classification error-rate estimators, Technometrics, 27, pp 1 99-206. Snapinn, S . M . and Knoke, J .O. ( 1 989) Estim ati on of en?or rates in discri m i nant analysis with selection of vaxiables, B iometrics, 45, pp 289-299. Sonquist, J.A. ( 1 970) Mul tivariate Model B ui lding: The Validation of a Searc h S trategy, Ann Arbor, Institute for Social Research, University o f Michigan. Tay lor, P.C. and S il verm an, B .W . ( 1 993) Block diag rams and splitting criteria for classification trees, Statistics and Computing. 3. pp 1 47- 1 6 1 . Todeschini, R. and Marengo, E. ( 1 992) Linear discrimi nant c lassification tree: A user-driven mul ti? criteria c lassification method. Chemometrics and Intel ligent Laboratory S y stems, 1 6, pp 25- 35. Toussaint, G.T. ( 1 974) B ibl iography on estim ation of misclassification. I E E E Transactions on Information Theory , 20, pp 472-479. Wahl, P.W. and Kronmal. R.A. ( 1 977) Discriminant fu nct ions when covariances are unequal and sample sizes are moderate, B iometxics. 33. pp 479-484. Wernecke, K-0. and Kalb, G. ( 1 9 83) Further resu l ts in estim ati n ? the c lassi fication error in discriminance analysis, B iomerrical JournaL 25. pp 247-258. ? Wernecke, K-0. , Kalb, G. and Sturzebecher. E. ( 1 980) Com parison of vari o u s proced ures for estimation o f the c lassi fication error i n discri m i nance analysis. B io metri cal Journal, 2 2 . pp 639-649. Wolberg, W.H., Tanner, M.A. , Loh , W.Y. and Vanic hsetakuL N. ( 1 987) Statistical approach to fine needle aspiration diagnosis of breast masses, Acta Cyrologica. 3 1 . pp 7 3 1 -74 1 . 246 ADDENDA p 5 , I 2 : change " i s defined to be" to "assigns a random observation x to populat ion Tii if ' . p 5 , I - 5 : change "Let us" to " In the case of two p-d imensional mu lt ivariate normal populat ions (the general mult ivar iate e l l ipso idal case is not cons idered here) , let us". p 5 , 1 - 1 : p 6 , 1 1 1 : in formula (2.2. 1 ) change "I-1 " to "Ii-1 " . omi t "much". ' ' p 7 : i n the formu la after ( 2 . 2 . 9 ) change "R1 (T) = ( . . . ]" to "R1 (T) = Pr( . . . ]" and change " ln (-r.117r2) . p 7 , 1 - 1 : p 9 , 1 1 1 : p 9 , / 1 6 : p42, 1 7 : after "cumulat ive" add "standard ised". " change " D -- < 0 V i > J." Jl ' . "' /\ change " D 1 2( x )" to " D 1 3( x )" . change "C and C " to "SC and SC " 1 2 1 2 . p48 , 1 -1 0: change "L = IJ + L(T)I2" to " E + 1 /2" . p6 5 , / 5 : after "error rate" add "estimate". p66 , Note: The d iscussion from ( 4 . 2 . 2 ) on is restricted to mult ivar iate normal data us ing LOA p6 9 : on a new l i ne after ( 4 . 2 . 7 ) add "where nii i s the number of observations from nJ fa lsely c lassified to Tii, i :t j " . p74 , I 1 2 : after " (e) . " add "The above model involves a fu l l factoria l des ign . it may have been better to adopt some form of fract ional design, so a l l owing a wider range of factors to be explored. However , as is a lmost a lways the case in s imu lat ion studies, the possible range of factors that can be explored is vast, so that a l i ne has to be d rawn somewhere . " p7 4 , / -4 : change "xii is lognormal" to " log (xii) i s normal ( 0 , 1 )" . p7 4 , / -3 : delete "wh ich is log normal ( 0 , 1 )" . p74 : I - 1 On a new l ine add " I n summary, univar iate normal data, xi . was generated for each d imens ion j, then transformed to yi = exp(xi) . F ina l ly , the data was standard i sed g iv ing marg inals with mean 0 and standard deviation 1 , that is , The rank ing of observations has changed after standard isation thougm the zi are independent ( uncorrelated) as the xi are independent . " \ ? p 7 5 , I - 3 : after " ( 1 9 9 0 ) . " add "The two tree-based methods were chosen because of their ready ava i lab i l i ty and representing two d ifferent approaches to t ree-based classifi cat ion. LOA and QOA were selected because they are the two most common ly used classif ication methods . " p 7 6 , 1 - 1 4 : After "here . " add " l t may have been preferable to have used a separate test set instead of the cross-va l idat ion method wh ich does introduce poss ib le error. lt was decided to use cross-val i dation instead of an independent test set as the former is used more often in the rea l world as large test sets are usua l ly unava i l ab le . " p82 , I 8 : after "was used . " add "This decrease in LOA error rate between normal and lognorma l is most probably due to the effects of standard isat ion wh ich ma intains the theoretica l covariance d ifferences, hence distances between populat ions. However, as the d istr ibutions of the two populat ions are skewed, the lower 7 5 % .of the d istribution w i l l be bunched together around a h igh peak, thus closer to -the respect ive class mean than i n the case of norma l ly d i str ibuted data. The net effect is that fewer observations are m isclassified for standard i sed lognormal data . " p83 , 1 -9 : after "transformat ion . " add " I n th is case, for pure lognormal ly d istr i buted data , the actual values of o wi l l be d ifferent from those g iven in Sect ion 4 . 3 . 1 ' ' p 8 3 , 1 -7 : after "true" add " (pure)". p 8 3 , 1 -6 : change " log[f( . )]" to ' l n[f(x)]" . p9 1 , I - 2 : after "data" add " , though th is a lters the corre lations between variables so that the covariance matrices are d ifferent from those in ( 4 . 3 . 2 ) . " p92, 1 3 : change "covariance" to "variance". p 9 3 , I - 5 : i nsert a transpose symbol ( ' ) between ")" and "S-1 " . p 9 4 , I -1 : on a new l i ne add "though absolute d ifferences are used i n the g raphs to make for eas ier comparison between methods." p 9 6 , I 3 : after "rates" add "(to make the g raph easier to read)". p 9 7 ff: "e" c lass if icat ions as on p 7 5 . p 1 1 6 : after 1 -7 add "wit h ?i > 0 1 j = 1 1 2 . " p 1 1 7 I I - 1 3 : after "the i rs . " a d d " B i nary data rather t h a n general categ orica l data was a l so used for s i m p l icity. U s i n g categorical v a r i a b l es contai n i n g m o re than two categori e s wou l d i nv o l ve creating a l arge n umber of bi nary vari a b l es to use in LOA and Q DA. " p 1 1 7 I I - 1 1 : after "n" add "(total sample s ize)". p 1 1 7 1 I - 8 : after "riik = 0 . 2 5 " add " ? j :t; k" p 1 1 9 : o m i t " Leve l " from Table 5 . 2 . p 1 2 1 : omit "pii = " from Table 5 . 4 . \ p 1 22 1 I - 1 0 : after "of R(T) . " a d d "The mean square error ( M S E ) criteri o n was used to compare d iffe rent e rror rate esti mators for each method (see p 1 0 8 ) ' 1 p 1 2 6 : o m i t " Leve l " from T a b l e 5 . 7 . p 1 3 0 1 I -7 : ch a n g e "of t h e s a m p l e " to "associ ated with the classificat i o n tre e " . p 1 32 1 I - 2 : ch ange " t h e F - ratio s h o u l d not b e regarded as a true measure of the statist ical s i g n ifi cance of each resu l t . " to "a stat i st i ca l l y s i g n ificant F -rat i o m ay not be of su bstantive s i g n ifi ca n ce . " p 1 6 5 1 I - 9 : change "vari a n ces" to "cova riance matri ces". p1 7 1 1 I - 1 2 : chan g e "vari b l e" to "vari a b l e ... p 1 7 8 : i n Ta b l e 7 . 6 ( a nd a t t h e bottom of p 1 7 8 ) 1 cha nge "R( ilj )" to " L:jR(i/j )" . p 1 8 4 1 1 1 6 : after " s i g n ifica nt" a d d "(with u n i variate tests)" . p 1 8 6 1 I 1 4 : convert "chan ge" to "changed". p 1 8 7 : N ote: H u sb_Edu a n d Wife_Edu a re o rd inal categorica l vari a b l e s with v a l u e s rang i ng from zero to seven and refer to t h e education level o f the husband a n d wife respect i v ely. N o_ C h i l d refers to the numbe r of ch i l d ren in the fa m i l y . p 2 0 9 1 1 -8 : o m i t " e n o u g h " . p2 3 1 1 / 1 1 : c h a n g e " s i g n ifica nt" to " m aj or". p 2 3 6 : on a new l i n e after / 1 2 add " U S 2 3 univariate s p l it". P 2 3 7 I 8 . change "C C " to "SC SC " I ? 1 1 2 1 I 2 ? p2 38: p239: o n a n ew l i ne after / 1 2 add "nii 69 the n u m be r of observati o n s from rri fa l s e l y classified t o IT/' . after " B re i m a n , . . . ( 1 984) . . . " add the reference "B rown, D .E . , C o rru b l e , V. a n d P i ttard , C. L . ( 1 993)" A comparison of dec i s i o n tree class ifi e rs with b a ck? propagation neural n etworks for m u l t i m o d a l c l a s s ifi cation p ro b l e m s , P atte rn Reco g n it ion, 26, pp 9 53-96 1 . " p239 ( 2 ) : Note: The page n u m bers for the B i b l i og raphy a re wrong. Th i s s e?t i o n s h o u l d start o n p24 1 a n d a l l oth e r pages shou l d be put back two pages.