Complexity measurement for dealing with class imbalance problems in classification modelling : a thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy, Massey University, 2012

dc.contributor.authorAnwar, Muhammad Nafees
dc.date.accessioned2013-04-16T01:53:59Z
dc.date.available2013-04-16T01:53:59Z
dc.date.issued2012
dc.description.abstractThe class imbalance problem is a challenge in the statistical, machine learn- ing and data mining domains. Examples include fraud/intrusion detection, medical diagnosis/monitoring, bioinformatics, text categorization, insurance claims, and target marketing. The problem with imbalanced data sets is that the conventional classifiers (both statistical and machine learning algorithms) aim at maximizing overall accuracy, which is often achieved by allocating all, or almost all, cases to the majority class. Thus there tends to be bias against the minority class in class imbalance situations. Despite numerous algorithms and re-sampling techniques proposed in the last few decades to tackle imbalanced classification problems, there is no consistent winning strategy for all data sets (neither in terms of sampling, nor learning algorithm). Special attention needs to be paid to the data in hand. In doing so, one should take into account several factors simultaneously: the imbalance rate, the data complexity, the algorithms and their associated parameters. As suggested in the literature, mining such datasets can only be improved by algorithms tailored to data characteristics; therefore it is important and necessary to do data exploratory analysis before deciding on a learning algorithm or re-sampling techniques. In this study, we have developed a framework "Complexity Measurement" (CM) to explore the connection between the imbalanced data problem and data complexity. Our study shows that CM is an ideal candidate to be recognized as a "goodness criterion" for various classifiers, re-sampling and feature selection techniques in the class imbalance framework. We have used CM as a meta-learner to choose the classifier and under-sampling strategy that best fits the situation. We design a systematic over-sampling technique, Over-sampling using Complexity Measurement (OSCM) for dealing with class overlap. Using OSCM, we do not need to search for an optimal class distribution in order to get favorable accuracy for the minority class, since the amount of over-sampling is determined by the complexity; ideally using CM would detect fine structural differences (class-overlap and small disjunct) between different classes.Existing feature selection techniques were never meant for class imbalanced data. We propose Feature Selection using Complexity Measurement (FSCM), which can specifically focus on the minority class, hence those features (and multivariate interactions between predictors) can be selected, which form a better model for the minority class. Methods developed have been applied to real datasets. The results from imbalanced datasets show that CM, OSCM and FSCM are effective as a systematic way of correcting class imbalance/overlap and improving classifier performance. Highly predictive models were built; discriminating patterns were discovered, and automated optimization was proposed. The methodology proposed and knowledge discovered will benefit exploratory data analysis for imbalanced datasets. It may be taken as a judging criterion for new algorithms and re-sampling techniques. Moreover, new data sets may be evaluated using our CM criterion in order to build a sensible model.en
dc.identifier.urihttp://hdl.handle.net/10179/4287
dc.language.isoenen
dc.publisherMassey Universityen_US
dc.rightsThe Authoren_US
dc.subjectComputational complexityen
dc.subjectClass imbalanceen
dc.subjectClassificationen
dc.subjectSamplingen
dc.subjectStatisticsen
dc.titleComplexity measurement for dealing with class imbalance problems in classification modelling : a thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy, Massey University, 2012en
dc.typeThesisen
massey.contributor.authorAnwar, Muhammad Nafeesen
thesis.degree.disciplineStatisticsen
thesis.degree.grantorMassey Universityen
thesis.degree.levelDoctoralen
thesis.degree.nameDoctor of Philosophy (Ph.D.)en
Files
Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
02_whole.pdf
Size:
945.28 KB
Format:
Adobe Portable Document Format
Description:
Loading...
Thumbnail Image
Name:
01_front.pdf
Size:
79.9 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
804 B
Format:
Item-specific license agreed upon to submission
Description: