Research on adjacent matrix for K-means clustering : a thesis presented for the degree of Master of Computer Science in School of Natural and Computational Sciences at Massey University, Auckland, New Zealand
Machine learning is playing a vital role in our modern world. Depending
on whether the data has labels or not, machine learning mainly contains
three categories, i.e., unsupervised learning, supervised learning, and
semi-supervised learning. As labels are usually difficult and expensive to
be obtained, unsupervised learning is more popular, compared to supervised
learning and semi-supervised learning. Moreover, k-means clustering
is very popular in the domain of unsupervised learning. Hence, this
thesis focuses on the improvement of previous k-means clustering.
K-means clustering has been widely applied in real applications due
to its linear time complexity and ease of implementation. However, kmeans
clustering is limited to its applicability due to the issues, such as
identification of the cluster number k, initialisation of centroids, as well
as the definition of similarity measurements for evaluating the similarity
between two data points. Hence, k-means clustering is still a hot research
topic in unsupervised learning. In this thesis, we propose to improve traditional
k-means clustering by designing two different similarity matrices
to represent the original data points.
The first method first constructs a new representation (i.e., an adjacent
matrix) to replace the original representation of data points, and then runs
k-means clustering on the resulted adjacent matrix. In this way, our proposed
method benefits from the high-order similarity among data points
to capture the complex structure inherent in data points as well as avoids
the time-consuming process of eigenvectors decomposition in spectral
The second method takes into account the weights of the features to
improve the former method, based on the assumption that different features
contain different contributions to the construction of the clustering
models. As a result, it makes the clustering model more robust, compared
to the first method as well as previous clustering methods.
Finally, we tested our proposed clustering methods on public UCI datasets.
Experimental results showed the clustering results of our proposed
methods significantly outperformed the comparison methods in terms of
three evaluation metrics.