• Login
    View Item 
    •   Home
    • Massey Documents by Type
    • Theses and Dissertations
    • View Item
    •   Home
    • Massey Documents by Type
    • Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    A comparison of tree-based and traditional classification methods : a thesis presented in partial fulfilment of the requirements for the degree of PhD in Statistics at Massey University

    Icon
    View/Open Full Text
    02_whole.pdf (7.522Mb)
    01_front.pdf (1.347Mb)
    Export to EndNote
    Abstract
    Tree-based discrimination methods provide a way of handling classification and discrimination problems by using decision trees to represent the classification rules. The principal aim of tree-based methods is the segmentation of a data set, in a recursive manner, such that the resulting subgroups are as homogeneous as possible with respect to the categorical response variable. Problems often arise in the real world involving cases with a number of measurements (variables) taken from them. Traditionally, in such circumstances involving two or more groups or populations, researchers have used parametric discrimination methods, namely, linear and quadratic discriminant analysis, as well as the well known non-parametric kernel density estimation and Kth nearest neighbour rules. In this thesis, all the above types of methods are considered and presented from a methodological point of view. Tree-based methods are summarised in chronological order of introduction, beginning with the Automatic Interaction Detector (AID) method of Morgan and Sonquist (1963) through to the IND method of Buntine (1992). Given a set of data, the proportion of observations incorrectly classified by a prediction rule is known as the apparent error rate. This error rate is known to underestimate the actual or true error rate associated with the discriminant rule applied to a set of data. Various methods for estimating this actual error rate are considered. Cross-validation is one such method which involves omitting each observation in turn from the data set, calculating a classification rule based on the remaining (n-1) observations and classifying the observation that was omitted. This is carried out n times, that is for each observation in the data set and the total number of misclassified observations is used as the estimate of the error rate. Simulated continuous explanatory data was used to compare the performance of two traditional discrimination methods, linear and quadratic discriminant analysis, with two tree-based methods, Classification and Regression Trees (CART) and Fast Algorithm for Classification Trees (FACT), using cross-validation error rates. The results showed that linear and/or quadratic discriminant analysis are preferred for normal, less complex data and parallel classification problems while CART is best suited for lognormal, highly complex data and sequential classification problems. Simulation studies using categorical explanatory data also showed linear discriminant analysis to work best for parallel problems and CART for sequential problems while CART was also preferred for smaller sample sizes. FACT was found to perform poorly for both continuous and categorical data. Simulation studies involving the CART method alone provided certain situations where the 0.632 error rate estimate is preferred to cross-validation and the one standard error rule over the zero standard error rule. Studies undertaken using real data sets showed that most of the conclusions drawn from the continuous and categoiical simulation studies were valid. Some recommendations are made, both from the literature and personal findings as to what characteristics of tree-based methods are best in particular situations. Final conclusions are given and some proposals for future research regarding the development of tree-based methods are also discussed. A question worth considering in any future research into this area is the use of non-parametric tests for determining the best splitting variable.
    Date
    1994
    Author
    Lynn, Robert D
    Rights
    The Author
    Publisher
    Massey University
    URI
    http://hdl.handle.net/10179/3915
    Collections
    • Theses and Dissertations
    Metadata
    Show full item record

    Copyright © Massey University
    | Contact Us | Feedback | Copyright Take Down Request | Massey University Privacy Statement
    DSpace software copyright © Duraspace
    v5.7-2023.7-7
     

     

    Information PagesContent PolicyDepositing content to MROCopyright and Access InformationDeposit LicenseDeposit License SummaryTheses FAQFile FormatsDoctoral Thesis Deposit

    Browse

    All of MROCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Statistics

    View Usage Statistics

    Copyright © Massey University
    | Contact Us | Feedback | Copyright Take Down Request | Massey University Privacy Statement
    DSpace software copyright © Duraspace
    v5.7-2023.7-7