A study of frequent pattern mining in transaction datasets : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Palmerston North, New Zealand
Within data mining, the efficient discovery of frequent patterns—sets of items
that occur together in a dataset—is an important task, particularly in transaction
datasets. This thesis develops effective and efficient algorithms for frequent
pattern mining, and considers the related problem of how to learn, and utilise, the
characteristics of the particular datasets being investigated.
The first problem considered is how to mine frequent closed patterns in dynamic
datasets, where updates to the dataset are performed. The standard approach to
this problem is to use a standard pattern mining algorithm and simply rerun it
on the updated dataset. An alternative method is proposed in this thesis that is
significantly more efficient provided that the size of the updates is relatively small.
Following this is an investigation of the pattern support distribution of transaction
datasets, which measures the numbers of times each pattern appears within the
dataset. The evidence for the pattern support distribution of real retail datasets
obeying a power law is investigated using qualitative appraisals and statistical
goodness-of-fit tests, and the power law is found to be a good model. Based on
this, the thesis demonstrates how to efficiently estimate the pattern support distribution
based on sampling techniques, reducing the computational cost of finding
The last major contribution of the thesis is to consider novel ways to set the main
user-specified parameters of frequent pattern mining, the minimum support, which
defines how many times a pattern needs to be seen before it is ‘frequent’. This is a
critical parameter, and very hard to set without a lot of knowledge of the dataset.
A method to enable the user to specify rather looser requirements for what they
require from the mining is proposed based on the assumption of a power-law-based
pattern support distribution and fuzzy logic techniques.