Copyright is owned by the Author of the thesis.  Permission is given for 
a copy to be downloaded by an individual for the purpose of research and 
private study only.  The thesis may not be reproduced elsewhere without 
the permission of the Author. 
 

Improved K-means Clustering Algorithms 
 
 
A thesis presented in partial fulfilment of the requirements for the 

degree of Doctor of Philosophy in Computer Science 

Massey University 

 
Tong Liu 

2020 

  
Abstract 

K-means clustering algorithm is designed to divide the samples into subsets 

with the goal that maximizes the intra-subset similarity and inter-subset dissimilarity 

where the similarity measures the relationship between two samples. As an 

unsupervised learning technique, K-means clustering algorithm is considered one of the 

most used clustering algorithms and has been applied in a variety of areas such as 

artificial intelligence, data mining, biology, psychology, marketing, medicine, etc.  

K-means clustering algorithm is not robust and its clustering result depends on 

the initialization, the similarity measure, and the predefined cluster number. Previous 

research focused on solving a part of these issues but has not focused on solving them 

in a unified framework. However, fixing one of these issues does not guarantee the best 

performance. To improve K-means clustering algorithm, one of the most famous and 

widely used clustering algorithms, by solving its issues simultaneously is challenging 

and significant.  

 This thesis conducts an extensive research on K-means clustering algorithm 

aiming to improve it. 

First, we propose the Initialization-Similarity (IS) clustering algorithm to solve 

the issues of the initialization and the similarity measure of K-means clustering 

algorithm in a unified way. Specifically, we propose to fix the initialization of the 

clustering by using sum-of-norms (SON) which outputs the new representation of the 

original samples and to learn the similarity matrix based on the data distribution. 

Furthermore, the derived new representation is used to conduct K-means clustering.  


Second, we propose a Joint Feature Selection with Dynamic Spectral (FSDS) 

clustering algorithm to solve the issues of the cluster number determination, the 

similarity measure, and the robustness of the clustering by selecting effective features 

and reducing the influence of outliers simultaneously. Specifically, we propose to learn 

the similarity matrix based on the data distribution as well as adding the ranked 

constraint on the Laplacian matrix of the learned similarity matrix to automatically 

output the cluster number. Furthermore, the proposed algorithm employs the L2,1-norm 

as the sparse constraints on the regularization term and the loss function to remove the 

redundant features and reduce the influence of outliers respectively.  

Third, we propose a Joint Robust Multi-view (JRM) spectral clustering 

algorithm that conducts clustering for multi-view data while solving the initialization 

issue, the cluster number determination, the similarity measure learning, the removal of 

the redundant features, and the reduction of outlier influence in a unified way.  

Finally, the proposed algorithms outperformed the state-of-the-art clustering 

algorithms on real data sets. Moreover, we theoretically prove the convergences of the 

proposed optimization methods for the proposed objective functions. 

 
Dedication 

 
In memory of my mother and father 

 
Acknowledgments 

I would like to express my sincere gratitude to my supervisors, Associate 

Professor XiaoFeng Zhu and Distinguished Professor Gaven Martin for all your 

guidance, encouragements and kindness.  

I am deeply thankful for the support and encouragement from Professor Dianne 

Brunton, Associate Professor Alona Ben-Tal, Associate Professor Evelyn Sattlegger 

and Ms. Linh Mills.  


Publications Arising from this Thesis 

Publications included in this thesis: 

• Zhou, J., Liu, T., & Zhu, J. Weighted adjacent matrix for K-

means clustering. Multimedia Tools and Applications, 2019. 

78: p. 33415–33434, corresponding author: Liu, T, 

incorporated as Chapter 1 and Chapter 2.  

• Liu, T., Zhu, J., Zhou, J., Zhu, Y., & Zhu, X. Initialization-

similarity clustering algorithm. Multimedia Tools and 

Applications, 2019. 78: p. 33279–33296, incorporated as 

Chapter 3.  

• Liu, T., & Martin, G.  Joint Feature Selection with Dynamic 

Spectral Clustering. Revised version submitted to Neural 

Processing Letters (Springer). DOI: 10.1007/s11063-020-

10216-9, incorporated as Chapter 4.  

• Liu, T.,  Martin, G., Zhu, Y., Peng, L., & Li, L.  Joint Robust 

Multi-view Spectral Clustering. Submitted to Neural 

Processing Letters (Springer). DOI: 10.1007/s11063-020-

10257-0, incorporated as Chapter 5.  

 
viii 
 

Table of Contents 

Table of Contents ........................................................................................................... viii 

List of Tables ................................................................................................................... xi 

List of Figures ................................................................................................................. xii 

List of Notations............................................................................................................. xiii 

Chapter 1 Introduction ...................................................................................................... 1 

1.1 Motivation .................................................................................................. 1 

1.2 Research Objectives ................................................................................... 6 

1.3 Thesis Structure .......................................................................................... 6 

Chapter 2 Literature Review ............................................................................................. 9 

2.1 Clustering Algorithms ................................................................................ 9 

2.2 Feature Selection ...................................................................................... 23 

2.3 Outlier Reduction ..................................................................................... 26 

2.4 Evaluation Measure .................................................................................. 29 

2.5 Summary .................................................................................................. 30 

Chapter 3  Initialization-Similarity Clustering Algorithm .............................................. 32 

3.1 Introduction .............................................................................................. 32 

3.2 Motivation ................................................................................................ 34 

3.3 Proposed Algorithm ................................................................................. 38 

3.4 Optimization ............................................................................................. 40 

3.5 Convergence Analysis .............................................................................. 44 

3.6 Experiments .............................................................................................. 45 

3.6.1 Data Sets ........................................................................................ 45 

3.6.2 Comparison Algorithms ................................................................. 47 

3.6.3 Experiment Setup ........................................................................... 47 


ix 
 

3.6.4 Experimental Results Analysis ...................................................... 48 

3.6.5 Parameters’ Sensitivity .................................................................. 53 

3.6.6 Convergence ................................................................................... 54 

3.7 Conclusion ................................................................................................ 59 

Chapter 4  Joint Feature Selection with Dynamic Spectral Clustering ........................... 60 

4.1 Introduction .............................................................................................. 60 

4.2 Motivation ................................................................................................ 63 

4.3 Proposed Algorithm ................................................................................. 67 

4.4 Optimization ............................................................................................. 70 

4.5 Convergence Analysis .............................................................................. 76 

4.6 Experiments .............................................................................................. 80 

4.6.1 Data Sets ........................................................................................ 80 

4.6.2 Comparison Algorithms ................................................................. 82 

4.6.3 Experiment Setup ........................................................................... 82 

4.6.4 Experimental Results Analysis ...................................................... 83 

4.6.5 Parameters’ Sensitivity .................................................................. 85 

4.6.6 Convergence ................................................................................... 85 

4.7 Conclusion ................................................................................................ 86 

Chapter 5 Joint Robust Multi-view Spectral Clustering ................................................. 92 

5.1 Introduction .............................................................................................. 92 

5.2 Motivation ................................................................................................ 96 

5.3 Proposed Algorithm ................................................................................. 98 

5.4 Optimization ........................................................................................... 101 

5.5 Convergence Analysis ............................................................................ 107 

5.6 Experiments ............................................................................................ 111 


x 
 

5.6.1 Data Sets ...................................................................................... 111 

5.6.2 Comparison Algorithms ............................................................... 111 

5.6.3 Experiment Setup ......................................................................... 113 

5.6.4 Experimental Results Analysis .................................................... 113 

5.6.5 Parameters’ Sensitivity ................................................................ 118 

5.6.6 Convergence................................................................................. 120 

5.7 Conclusion .............................................................................................. 121 

Chapter 6 Conclusion and Future Work ....................................................................... 123 

6.1 Conclusion .............................................................................................. 123 

6.2 Future Directions .................................................................................... 125 

References ..................................................................................................................... 127 

 
xi 
 

List of Tables 

Table 3.1 The pseudo code for K-means clustering algorithm .................................... 36 

Table 3.2 The pseudo code for the spectral clustering algorithm ................................ 37 

Table 3.3 Description of ten benchmark data sets ....................................................... 45 

Table 3.4 ACC results of IS algorithm on ten benchmark data sets ............................ 50 

Table 3.5 NMI results of IS algorithm on ten benchmark data sets ............................ 50 

Table 3.6 Purity results of IS algorithm on ten benchmark data sets .......................... 51 

Table 4.1 Description of benchmark datasets .............................................................. 82 

Table 4.2 ACC results of FSDS algorithm on eight benchmark data sets ................... 87 

Table 4.3 Purity results of FSDS algorithm on eight benchmark data sets ................. 87 

Table 5.1 The six multi-view benchmark data sets .................................................... 112 

Table 5.2 ACC results of JRM algorithm on six multi-view data sets ...................... 116 

Table 5.3 Purity results of JRM algorithm on six multi-view data sets ..................... 116 


xii 
 

List of Figures  

Figure 1.1  K-means Flowchart ...................................................................................... 2 

Figure 1.2  Research Framework ................................................................................... 8 

Figure 3.1 ACC results of IS algorithm on ten benchmark data sets ........................... 51 

Figure 3.2 NMI results of IS algorithm on ten benchmark data sets ........................... 52 

Figure 3.3 Purity results of IS algorithm on ten benchmark data sets ......................... 52 

Figure 3.4  ACC results of IS algorithm with respect to different parameter settings . 55 

Figure 3.5  NMI results of IS algorithm with respect to different parameter settings . 56 

Figure 3.6  Purity results of IS algorithm with respect to different parameter settings57 

Figure 3.7  Objective function values (OFVs) versus iterations for IS algorithm ....... 58 

Figure 4.1 ACC results of FSDS algorithm on eight benchmark data sets .................. 88 

Figure 4.2 Purity results of FSDS algorithm on eight benchmark data sets ................ 88 

Figure 4.3 ACC results of FSDS algorithm with respect to different parameter settings

 ...................................................................................................................................... 89 

Figure 4.4 Purity results of FSDS algorithm with respect to different parameter settings

 ...................................................................................................................................... 90 

Figure 4.5 Objective function values (OFVs) versus iterations for FSDS algorithm .. 91 

Figure 5.1 ACC results of JRM algorithm on six real data sets................................. 117 

Figure 5.2 Purity results of JRM algorithm on four real data sets ............................. 117 

Figure 5.3 ACC results of JRM algorithm with respect to different parameter settings

 .................................................................................................................................... 119 

Figure 5.4 Purity results of JRM algorithm with respect to different parameter settings

 .................................................................................................................................... 120 

Figure 5.5 Objective function values (OFVs) versus iterations for JRM algorithm .. 122 


xiii 
 

List of Notations 

Symbols Description  

𝐗𝐗 Data matrix 

𝐱𝐱 A vector of 𝐗𝐗  

𝐱𝐱𝑖𝑖  The 𝑖𝑖-th row of 𝐗𝐗  

𝑥𝑥𝑖𝑖,𝑗𝑗  The element in the 𝑖𝑖-th row and 𝑗𝑗-th column of 𝐗𝐗 

‖𝐱𝐱‖2  L2-norm of 𝐱𝐱  

‖𝐗𝐗‖2,1
  L2,1-norm of 𝐗𝐗  

‖𝐗𝐗‖𝐹𝐹  The Frobenius norm or the Euclidean norm of 𝐗𝐗  

𝐗𝐗𝑇𝑇 

𝐾𝐾 

𝑉𝑉 

𝑣𝑣 

𝐗𝐗𝑣𝑣 

The transpose of 𝐗𝐗 

Cluster number 

Number of views 

View index 

Data matrix in the v-th view 

 
Chapter 1. Introduction 

1 
 

Chapter 1 

Introduction 

Machine learning is a subfield of artificial intelligence that provides machines the ability 

to learn and improve automatically without being explicitly programmed to do so. If the 

machine learns to label the data automatically without knowing the pattern of the data 

beforehand, this type of machine learning is called unsupervised learning. Unsupervised 

learning is important because it is difficult to know the pattern of the data in advance [1, 

2].  

As an unsupervised learning technique, clustering divides a given data set into 

groups with the goal to both maximize the similarity of data points in the same group, and 

the dissimilarity of data points in different groups [3]. Clustering has been widely applied 

in scientific data analysis, data mining, biology, psychology, marketing, medicine, and 

insurance, etc. [4-9]. A search via Google Scholar found over 4.1 million entries with the 

keyword clustering on Dec 14, 2019. K-means clustering algorithm is one of the most 

popular and widely used clustering algorithms. K-means clustering algorithm has been 

used as part of many other algorithms since it is simple, trustable, promising, and 

mathematical tractability [10, 11].  

1.1 Motivation  

K-means clustering algorithm operates in the following steps: First, it initializes cluster 

centers via randomly selecting K data points as the K cluster centers. Second, it assigns 


Chapter 1. Introduction  

 
2 
 

each data point to its nearest cluster center according to a similarity measure, e.g., 

Euclidean distance. Third, it revises the K cluster centers as the mean of assigned data 

points. K-means clustering algorithm keeps repeating the last two steps until the algorithm 

achieves convergence [12]. The flow chart of K-means clustering algorithm is shown in 

Figure 1.1. K-means clustering algorithm is considered one of the most used clustering 

algorithms. It has been successfully applied to broad areas. Previous researches have 

addressed some of the issues of K-means clustering algorithm. But they didn’t address 

the limitations of K-means clustering algorithm in a unified manner. Addressing the 

limitations of K-means clustering algorithm in a unified way is challenging and 

significant. 

 
Figure 1.1  K-means Flowchart 

 
Start 

Set K initial cluster centers randomly 

Assign data points to the closest cluster centers 

Recalculate the new cluster centers 

Convergence? 

End 

Yes 

No 


Chapter 1. Introduction  

 
3 
 

First, the clustering result of K-means clustering algorithm depends on the 

initialization of cluster centers but random choosing the cluster centers may not lead to 

an optimal result. It is also difficult to reproduce the clustering results due to the 

randomness of initialization of K-means clustering algorithm. Many of the current 

clustering algorithms have solved the initialization problem of K-means clustering 

algorithm [4, 13-15]. For example, Duan et al. developed an algorithm to calculate the 

density to select the initial cluster centers [13]. Lakshmi et al. proposed to use nearest 

neighbors and feature means to decide the initial cluster centers [14]. 

Second, the clustering result of K-means clustering algorithm depends on the 

similarity measure. K-means clustering algorithm assigns each data point to its closest 

cluster center based on a similarity measure. Euclidean distance is often used in K-means 

clustering algorithm to determine the similarity by calculating the distance between two 

data points. However, Euclidean distance measure does not account for the factors such 

as cluster sizes, dependent features or density [16, 17]. Thus K-means clustering 

algorithm is not good for indistinct or not well-separated data sets [18]. Several works 

addressed the similarity measure problem of K-means clustering algorithm [19-24]. For 

example, spectral clustering algorithm uses spectral representation to replace original data 

points, and then conducts K-means clustering. To do this, spectral clustering algorithm 

first generates the similarity matrix and then conducts eigenvalue decomposition on the 

similarity matrix to obtain the spectral representation. Finally, K-means clustering is 

conducted on the spectral representation.  


Chapter 1. Introduction  

 
4 
 

Third, K-means clustering algorithm relies on the given cluster number K. As an 

unsupervised algorithm, K-means clustering algorithm is supposed to be used against data 

which is not labelled. Without knowing the label, the cluster number may not be known 

beforehand. Robust continuous clustering algorithm is able to automatically calculate the 

cluster number beforehand [4]. However, this algorithm needs a well calculated similarity 

matrix beforehand as an input to be able to produce good clustering outcome.  

Previous clustering algorithms only fixed part of the issues of the K-means 

clustering algorithm. When a clustering algorithm addresses those problems separately, 

it is easily to be trapped into the sub-optimal results, which means it is hard to obtain a 

global optimal solution, for example, even if a best initial value is found to produce 

optimal results or the best similarity matrix is found to produce optimal results, but the 

final optimal results may not be obtained. Because the results of the individual steps are 

not obtained according to the requirements of the next step. It would be challenging and 

significant if a new clustering algorithm could fix the issues of the initialization, cluster 

number determination and similarity measure problems of K-means clustering algorithm 

in a unified framework, which means one aspect is reflected in other aspects to achieve 

global optimal results.  

Real-world data sets often contain high-dimensional features, some of which are 

insignificant for clustering. Data with high-dimensional features, i.e., high-dimensional 

data, increases the computation cost as well as the “Curse of Dimensionality”. In this 

circumstance, K-means clustering algorithm using Euclidean distance to measure the 


Chapter 1. Introduction  

 
5 
 

similarity is not robust to data with high-dimensional features [25]. Hence, reducing the 

redundant feature is needed for conduct clustering analysis on high-dimensional data.  

Data almost invariably contains noise, outliers and errors due to inadequate data 

measure, collection, processing or just the inherent variability. Outliers can distort the 

distribution of the data set. For example, K-means clustering algorithm using the mean of 

all data points in one cluster to decide the new cluster center makes sense when all the 

data points lie a normal distance from other data points. However, outliers can strongly 

impact the mean calculation of the whole cluster. As a result, this will push cluster centers 

closer to the outlier. Outliers could have a strong impact on the final cluster configuration. 

Hence, to achieve robust clustering performance, it is necessary to reduce the influence 

of outliers. 

Nowadays data could be collected from multiple sources or different aspects. For 

example, images shared on photo sharing sites such as Instagram or Flickr have 

complementary information such as description, tags, location, and video, etc. The data 

collected from multiple views are called multi-view data. Each view of the data set has 

its own properties to contribute to the understanding of the subject matter. Normally K-

means clustering algorithm was designed for clustering single-view data, the naive 

solution for conducting clustering on multi-view data by K-means clustering algorithm is 

to cluster the data with concatenated features across all views of the multi-view data. 

However, such a simple concatenation approach treats different views equally, even 

though different views have their own specific properties for their features. Hence, it is 


Chapter 1. Introduction  

 
6 
 

essential to improve K-means clustering algorithm on multi-view data clustering as well 

as solving the aforementioned issues.  

1.2 Research Objectives  

The aim of this thesis is to design and evaluate new clustering algorithms to overcome 

the issues of previous K-means clustering algorithm. The thesis framework is 

demonstrated in Figure 1.2. 

The specific objectives of this thesis are listed as follows: 

• Objective 1: To solve the issues of the initialization and the similarity measure of 

K-means clustering algorithm in a unified way. 

• Objective 2: To solve the issues of the cluster number determination, the similarity 

measure, and to improve the robustness of clustering by selecting effective features 

and reducing the influence of outliers in a unified way.  

• Objective 3: To develop multi-view clustering algorithm while solving the issues 

of the initialization, the cluster number determination, the similarity measure, 

feature selection and outlier reduction in a unified way.  

1.3 Thesis Structure 

This thesis is structured as follows.  

• Chapter 2 presents literature review including clustering analysis, feature selection, 

outlier reduction and evaluation measure.  


Chapter 1. Introduction  

 
7 
 

• Chapter 3 presents Initialization-Similarity (IS) clustering algorithm which solves 

the issues of initialization and similarity measure of K-means clustering algorithm 

in a unified way. IS clustering algorithm fulfills our objective 1. The proposed IS 

clustering algorithm outperformed both the classical clustering algorithms K-means 

clustering algorithm and well-known Spectral clustering algorithm. 

• Chapter 4 presents Joint Feature Selection with Dynamic Spectral (FSDS) 

clustering algorithm which solves the issues of cluster number determination, 

similarity measure, and the robustness of clustering by selecting useful features and 

reducing the influence of outliers in a unified way. FSDS clustering algorithm 

fulfills our objective 2. The proposed FSDS clustering algorithm outperformed the 

classical clustering algorithms K-means clustering algorithm, well-known Spectral 

clustering algorithm, Clustering and projected clustering with adaptive neighbors 

algorithm (CAN) [24] and Robust continuous clustering algorithm (RCC) [4]. 

• Chapter 5 presents Joint Robust Multi-view (JRM) Spectral Clustering algorithm 

solves initialization, cluster number determination, similarity measure, feature 

selection, and outlier reduction issues for multi-view data in a unified way. JRM 

clustering algorithm fulfills our objective 3. The proposed JRM clustering 

algorithm outperformed the classical clustering algorithms K-means clustering 

algorithm, Graph-Based system (GBS) [26], Adaptively weighted Procrustes 

(AWP) [27], and Multi-view low-rank sparse subspace clustering (MLRSSC) [28]. 

• Chapter 6 presents the conclusions and future work. 


Chapter 1. Introduction  

 
8 
 

Figure 1.2  Research Framework 

K-means issues 

Initialization 

Similarity measure 

Cluster number  

Robustness 

IS in Chapter 3 

Initialization  

Similarity measure 

 
FSDS in Chapter 4 

Similarity measure 

Cluster number 

Robustness: 

• Feature selection 

• Outlier reduction 

JRM in Chapter 5 

Initialization  

Similarity measure 

Cluster number 

Robustness: 

• Feature selection 

• Outlier reduction 

• Multi-view clustering 


Chapter 2. Literature Review 

9 
 

Chapter 2 

Literature Review 

Clustering is an unsupervised learning technique which divides a given data set into 

groups with the goal to maximize the intra-subset similarity and inter-subset 

dissimilarity. This chapter reviews the research topics related to this thesis, including 

the clustering algorithms, feature selection techniques, outlier reduction methods, and 

evaluation metrics. 

2.1 Clustering Algorithms 

Clustering algorithms can be classified as single-view clustering algorithms or multi-

view clustering algorithms based on if the clustering algorithms aim to cluster single-

view data or multi-view data.  

 
2.1.1 Single-view Clustering   

Clustering can also be generally categorized into non-graph-based approaches and 

graph-based approaches, based on whether the clustering algorithm constructs a 

similarity graph or not. 

 
A. Non-Graph-Based Algorithms 

The non-graph-based algorithms conduct clustering directly on the original data 

without constructing a similarity graph. The non-graph-based clustering algorithms can 


Chapter 2. Literature Review 

 
10 
 

be further grouped into different categories such as partitioning-based, hierarchical-

based, distribution-based, density-based, nature-based, etc.  

Partitioning-based clustering algorithms, also known as centroid-based 

clustering or distance-based clustering, divide data in one level into a number of 

partitions, where each partition represents a cluster. The center of the data points in 

each partition is regarded as the cluster center of the corresponding cluster. K-means 

clustering algorithm is one of the most famous representatives of this kind of clustering 

algorithms [29]. Specifically, K-means clustering algorithm first randomly selects K 

data points as the K cluster centers, and then assigns each data point to its nearest cluster 

center according to Euclidean distance. It keeps recalculating the  cluster centers 

followed by assigning each data point to a cluster until the algorithm achieves 

convergence [12]. However, K-means clustering algorithm needs the cluster number as 

input, so it is not suitable for a data set with an unknown cluster number. It is also 

sensitive to the initialization of the cluster centers because the random choice of cluster 

centers may produce different clustering results on different runs of this algorithm [29]. 

Furthermore, K-means clustering algorithm measures the similarity by using the 

Euclidean distance which gives the same importance to all the data points without 

consider other factors such as density, dependent features, shape, patterns or scale of 

data points [30, 31]. For example, it is difficult for K-means clustering algorithm to 

separate non-convex clusters. There are numbers of other algorithms based on 

partitioning clustering algorithms, e.g. K-medoids, COTCLUS, and Tabu search. K-

medoids chooses the data points located near their center to represent the clusters. The 


Chapter 2. Literature Review 

 
11 
 

rest of remaining data points are clustered with the representative data centers to which 

they are the most similar based on the minimal sum of the dissimilarities between data 

points and their corresponding cluster center points [32]. Instead of using only one 

center for each class, COTCLUS, an improved centroid-based clustering algorithm, 

uses suitable centroids from another clustering. It finds two centroids from one cluster 

and replace them by two centroids from the other cluster in such a way that maximum 

decreases the mean square error of the first clustering. It constructs a clustering from 

two suboptimal clustering results based on the belief that each suboptimal clustering 

has benefits regarding to containing some of the correct clusters [33]. After modifying 

centroids, it applies K-means clustering algorithm for final fine-tuning [33]. A Tabu 

based clustering algorithm employs the center driven approach of the K-means 

clustering algorithm with the guidance of Tabu search, which is a local or neighborhood 

search algorithm that accepts the worsening searches of no improving search is 

available and discourages the search from going back to previously visited search [34]. 

The K-medoids, COTCLUS, and Tabu search example like other partitioning-based 

clustering algorithms need to specify the cluster number K before the execution of the 

algorithms.   

In comparison with the partitioning-based clustering, which divides the set of 

data into un-nested clusters, the hierarchical-based clustering builds a tree of nested 

clusters. The hierarchical-based algorithms, also known as connectivity-based 

clustering, build a hierarchical relationship among data points to conduct clustering. 

Hierarchical clustering is usually represented by a tree structure, where each data point 


Chapter 2. Literature Review 

 
12 
 

is identified as a leaf and each node is a cluster. The division and agglomeration are 

two common approaches of the hierarchical-based clustering. In the division approach, 

which is also called top-down approach, all the data points are initially in one cluster 

and then are divided into smaller clusters recursively. Conversely the agglomerative 

approach also called bottom-up approach which treats each data point as a cluster at the 

start, and then continuously agglomerate pairs of clusters to build a cluster hierarchy 

until all clusters have been merged into a single cluster that contains all data points [35, 

36]. For example, the hierarchical clustering algorithm for binary data based on cosine 

similarity (HABOC) uses agglomerative hierarchical clustering procedure [37]. 

HABOC assesses similarity between data points and computes similarity of data sets 

containing multiple data points using the cosine similarity, and then exploits 

hierarchical clustering method to compresses data and merge two clusters based on the 

cosine feature vector of a set and additivity of the cosine feature vector of a set [37]. 

HABOC needs the cluster number as an initial parameter. Other hierarchical clustering 

examples include robust clustering using links (ROCK) and clustering using 

representatives (CURE) [38-44].  ROCK clustering algorithm draws a number of data 

points randomly from the original data set as inputs along with the desired cluster 

number K. Instead of using distances to conduct clustering, ROCK uses the number of 

links which is defined as the number of common neighbors as the similarity measure 

[42]. The reasoning behind is that the data points belonging to the same cluster most 

likely have a large number of common neighbours, thus more links. Hence the larger 

the number of links between data points, the greater likelihood they belong to the same 


Chapter 2. Literature Review 

 
13 
 

cluster. But ROCK ignores the possible differences in the similarity measure of 

different clusters inside the same data set. CURE selects well scattered points from the 

cluster to represent each cluster, and then shrink them toward the cluster [40]. It chooses 

more than one representative points from each cluster by using single linkage 

approaches, the similarity of two clusters is determined by the similarity of their most 

similar data points. Finally, the clusters with the closest representative points are 

clustered together. CURE uses random sampling and partitioning to speed up clustering 

[40]. But it is limited by choosing a fixed amount of scattered data points to represent 

cluster, and by applying a constant factor to shrink those representatives towards to 

their cluster centers [45]. CURE also ignores the information about the aggregate 

interconnectivity of data points in two clusters. Hierarchical clustering algorithms are 

particularly good when the data has an underlying hierarchical structure [35]. However, 

the efficiency of hierarchical clustering algorithms is relatively low compared with the 

linear complexity of partitioning clustering algorithms.  

Closely related to statistics, the distribution-based clustering algorithms assume 

that the data generated from the same distribution belongs to the same cluster. However, 

not all the data points have several distributions and the parameters have a strong impact 

on the clustering results [29]. Examples of distribution-based clustering algorithms 

include incremental local distribution-based clustering algorithm with the Bayesian 

adaptive resonance theory (ILBART) [46], Gaussian mixture model (GMM) [47] and 

balanced iterative reducing and clustering using hierarchies (BIRCH) [48, 49]. 

ILBART first obtains some data patterns with snapshots. Then, the data pattern is 


Chapter 2. Literature Review 

 
14 
 

clustered by cluster choosing, matching test, and updating learning three stages. The 

variation of the covariance determinant, the combining threshold and the choice 

function are simultaneously considered in determination of the local distribution of the 

winning cluster [46]. ILBART is sensitive to the data order and its computational 

stability needs to be improved [46]. GMM uses a probabilistic approach and describes 

each cluster by its cluster center, covariance, and size. It randomly initializes a fixed 

number of Gaussian distributions to the data and iteratively optimizes Gaussian 

distributions parameters such as mean, variance and weight for each cluster. Finally, it 

calculates the probabilities of data points belonging to each of the clusters [47]. There 

may be no Gaussian distributions for many real data sets. Besides the issue of Gaussian 

distributions assumption, choosing the initial number of Gaussian distributions sets and 

random initialization are also issues of Gaussian mixture model [50]. BIRCH clustering 

algorithm summarizes the information that retains as much distribution information as 

possible, and then conducts the clustering on the data summary. Specifically, BIRCH 

clustering algorithm takes original data set and desired cluster number, and then 

conducts clustering in the four phases.  It first computes the clustering feature tree. 

Second, it builds a smaller clustering feature tree and regrouping crowded sub-clusters 

into larger ones. Third, it computes the cluster centers of each cluster and uses an 

adaptation of the agglomerative clustering to cluster all the leaves of the clustering 

feature tree. Fourth, it uses the cluster centers to conduct the final clustering. BIRCH is 

sensitive to the data order and non-spherical clusters [39].  


Chapter 2. Literature Review 

 
15 
 

Density-based clustering algorithms partition data points into clusters defined 

as dense regions of data points separated by low-density regions. Examples of density-

based clustering algorithms include density-based spatial clustering of applications 

with noise (DBSCAN) [51, 52], an attempt at improving density-based clustering 

algorithms (AIDCA) [53], and two-phase clustering algorithm with a density exploring 

distance measure (TADEDM)[54]. DBSCAN recognizes each cluster by finding a 

distinctive density of points by a notably large amount higher than outside of the cluster. 

Minimum points, core points, border points and neighbourhood are important concepts 

in DBSCAN. Minimum points define the minimum number of points required to form 

a cluster. A core point is a point which has at least minimum points within 

neighbourhood from itself. A border point is a point has at least one core point at a 

neighbourhood distance. The neighbourhood value defines the cut-off distance of a data 

point from the core point for it to be clustered as a part of a cluster or not. A point is 

density-reachability point if it is within neighbourhood distance from the core point. A 

core point and all the points within a neighbourhood distance form a core set. All the 

overlapping core sets are grouped together to form a cluster. A point, neither a core nor 

a border point, and has less than minimum points within neighbourhood distance from 

itself is a noise point. DBSCAN is not entirely deterministic because some border points 

could be reachable from more than one cluster. DBSCAN depends on the distance 

threshold estimation and it cannot handle data sets with large varying densities [51]. 

AIDCA creates adaptive grids on the data and then merging cells based on local density 

to form a cluster [53]. It considers each axis of the grid space separately and creates a 


Chapter 2. Literature Review 

 
16 
 

number of initial bins for each axis. These uniform bins have size of the data on its axis 

divided by the number of bins. The density is the sum of the data points in each bin, 

resulting in a histogram. AIDCA goes through each bin and compares its density with 

the neighboring one. If the density is less than the set merge-value, the two bins are 

merged and will be part of the same grid square in the final adaptive grids. The result 

of this is that neighboring grids are likely to have differing densities. The distribution 

of grids should result in a small number of denser grids that contain cluster centers 

surrounded by a number of less dense grid cells that constitute the edges of the cluster 

that is merged into the core [53]. AIDCA needs to know the number of bins to create 

and the set merge-value. It is difficult to determine the correct set of parameters. 

TADEDM is a two-phase clustering method, which applies K-means clustering 

algorithm in the first phase to obtain the initial clusters which are used as inputs in the 

second phase [54]. In the second phase, all the data points are clustered using K-means 

clustering with a density exploring distance measure, which refers to that data points 

close in distance have high affinity with each other and data points locating in the same 

cluster have high affinity with each other [54].  Due to using the K-means clustering 

algorithm in this algorithm, it requires the prior cluster number and it also suffers the 

initialization problem. The density-based algorithms are based on the assumption that 

the data points in the high-density region belong to the same cluster. However, the 

results of density-based algorithms will suffer if the density of data points with large 

difference. Moreover, most density-based algorithms are also sensitive to the 

parameters estimation [55].  


Chapter 2. Literature Review 

 
17 
 

Imitating the behavior of natural and biological systems, some nature-inspired 

optimization algorithms have been developed [56]. The nature-inspired optimization 

algorithms are combined with clustering algorithms to obtain the global optimum 

solution. The crow search algorithm (CSA) combines the K-means clustering algorithm 

with intelligent behaviour of the crows to obtain the global optimum solution. CSA 

requires the cluster number to conduct the clustering [57]. The krill herd algorithm 

(KHA) models the behaviour of individual krill within a larger krill swarm to find the 

cluster center [58]. It randomly initializes the data structure representing a single krill, 

then it iteratively generating the fitness function for each krill (data point) of the 

population, which is similar to calculating the optimized functions for the coordinates 

of the Krill’s position [58]. The flower pollination algorithm (FPA) is another example 

of nature-inspired optimization procedures. It is inspired by the process of flower 

pollination.  Specifically, to mimic this behavior, FPA employs Levy flight distribution, 

which is a random walk in which the step lengths have heavier tails than the exponential 

distribution [58, 59]. CSA, KHA, and FPA are like most of the current nature-inspired 

algorithms lack of clear mathematical and theoretical proof of convergence [60]. 

 
B. Graph-Based Algorithms 

Instead of conducting clustering directly on the original data points, most graph-based 

clustering algorithms will first construct a graph and then apply a clustering algorithm 

to partition the graph. Graph representation represents the high-order relationship 

among data points which is easier to interpret the complex relationship inherent in the 


Chapter 2. Literature Review 

 
18 
 

data points than to interpret it from the original data points directly. A graph is a set of 

nodes or vertices with connected edges which have weights associated with them. A 

node or a vertex of the graph represents a data point and the edge represents the 

relationship between the data points. The similarity graph represents the similarities 

between data points. The similarity graph is represented by the similarity matrix, a 

square symmetric adjacency matrix, where the row and column indices represent the 

data points, and the entries indicate pairs of data points are connected or not. Two 

vertices are connected if the similarity between the corresponding data points is larger 

than a certain threshold. The edges within a cluster should have high weight values 

because data points within the same cluster are similar to each other. The edges between 

clusters should have low weight values because data points in different clusters are 

dissimilar from each other. Then the clustering problem is transformed into the graph 

cutting problem. The graph is cut into subgraphs, each subgraph being a cluster. The 

nodes in a cluster are well connected to nodes in the same cluster but not the nodes 

outside its cluster.  

Spectral clustering algorithm is a typical example of graph-based algorithms. It 

has become increasingly popular. Spectral clustering algorithm first creates a similarity 

matrix and a diagonal degree matrix, which is the sum of all the weights on each row 

in a similarity matrix. Then it defines a feature vector by computing the first K 

eigenvectors of its Laplacian matrix, which is the degree matrix subtracting the 

similarity matrix. Finally, it runs K-means clustering on these features to separate 


Chapter 2. Literature Review 

 
19 
 

objects into K clusters [61]. Spectral clustering algorithm is a multi-step algorithm and 

it requires the cluster number to be predefined.  

While some graph-based algorithms construct coefficient vectors of two data 

points to analyse the similarity between two data points [62], some graph-based 

algorithms construct hypergraph to represent a set of spatial data [63, 64]. For example, 

low-rank representation (LRR) identifies the subspace structures from data points and 

then finds the lowest rank representation among data points to represent the original 

data points [65]. A low-rank kernel learning graph-based clustering (LKLGC) 

algorithm is based on a multiple kernel learning with assumption that the consensus 

kernel matrix is a low-rank matrix and lies in the neighbourhood of the combined kernel 

matrix [66]. The spectral clustering algorithm is applied to get the final clustering 

results for LKLGC algorithm, hence the cluster number needs to be predefined [66]. A 

hybrid clustering algorithm based on minimum spanning tree of natural core points 

(NCP) first adaptively obtains the number of neighborhood parameter, and finds all the 

nature core points of datasets, and then it breaks the datasets into subsets and constructs 

the minimum spanning tree of natural core points. Finally, it cuts the maximum edge 

of the minimum spanning tree of natural core points iteratively until obtains the desired 

cluster number [67]. NCP needs the cluster number for its final step of clustering. 

The hierarchical clustering using dynamic modeling (CHAMELEON) uses a 

graph partitioning algorithm to divide the data points into several relatively small sub-

clusters initially, and then finds the genuine clusters by repeatedly combining these sub-

clusters if they are close together and interconnectivity is high [41]. CHAMELEON is 


Chapter 2. Literature Review 

 
20 
 

a graph-based two-phase hierarchical clustering and it requires the predefined cluster 

number.  

Clustering and projected clustering with adaptive neighbors algorithm (CAN) 

learns the data similarity matrix and then impose the rank constraint to the Laplacian 

matrix of the data similarity matrix [24]. In the end of the process the connected 

components in the resulted similarity matrix represent the clusters of the original data 

points [24]. CAN learns the data similarity matrix and clustering structure 

simultaneously. But it needs to know the number of the cluster beforehand.   

Robust Continuous Clustering (RCC) continuously optimizes a robust objective 

based on robust estimation [4]. RCC optimizes clustering and its new representation 

learning jointly [68]. According to RCC algorithm, each data point has a dedicated 

representative, which locates at the data point initially. Throughout the clustering 

process, the representatives move and combine into clusters. Despite objective function 

of RCC is not convex, the optimization is performed by using standard linear least 

squares solvers [4]. The RCC does not need prior knowledge of the cluster number. 

However, it needs the similarity matrix beforehand.   

Graph-based clustering algorithms improve non-graph-based clustering 

algorithms by generating the representation of original data points. However, current 

graph-based clustering algorithms use a multi-stage strategy which learns the similarity 

matrix, the new representation, or the clustering structure separately. The first stage 

goal of learning a similarity matrix does not always match the second stage goal of 

achieving optimal new representation, and thus not guaranteed to always outperform 


Chapter 2. Literature Review 

 
21 
 

non-graph-based clustering algorithms. Moreover, most graph-based clustering 

algorithms still use non-graph-based clustering algorithms in the final stage and thus do 

not simultaneously solve the initialization, similarity measure or cluster number issues 

of non-graph-based clustering algorithms. 

2.1.2 Multi-view Clustering 

The existing multi-view clustering algorithms can be broadly categorized to 

concatenation-based approach, distribution-based approach, and centralization-based 

approach. 

A concatenation-based multi-view algorithm conducts clustering on the new 

concatenated feature vectors of each view. Examples of concatenation-based 

algorithms include concatenation K-means clustering and feature concatenation multi-

view subspace clustering [69]. K-means clustering algorithms was developed for 

single-view data sets. For multi-view data sets, K-means clustering algorithm conducts 

clustering on the concatenated features across all views. This simple concatenation 

approach did not consider the unique nature of different views, even though different 

views have their own specific properties for their features.  Furthermore, it may lead to 

a critical issue of “curse of dimensionality”, which refers to a fixed number of data 

points become increasingly “sparse” as the dimensionality increase. The “curse of 

dimensionality” affects the clustering results [70].  

A distribution-based multi-view algorithm conducts clustering on every view of 

a multi-view data set individually, and then synthesizes these results from individual 

views for final clustering. For example, co-regularized spectral clustering algorithm 


Chapter 2. Literature Review 

 
22 
 

uses one single objective function for individual view and combines spectral graphs 

from different views for final K-means clustering [71]. A low-rank multi-view matrix 

completion (lrMMC) algorithm first seeks a low dimensional representation where the 

common subspace is constrained to be low rank and combination weights which are 

learned to explore complementarity between different views [72]. Mutual kernel 

completion algorithm applies different predefined kernels for different views. Then 

these kernels are combined to an unified kernel [73]. An ensemble approach to multi-

view multi-instance learning builds models on multiple heterogeneous data views by 

combining view learners and pursuing consensus among the weighted class [74]. 

However distribution-based multi-view algorithms do not fully use the information of 

multi-view and thus is unavailable to produce reasonable clustering results [75]. 

Compared with concatenation-based and distribution-based approaches, a 

centralization-based approach achieves better performance since it takes information 

from all views of a multi-view data set to conduct clustering [76].  A weighted hybrid 

fusion method constructs an objective function with rank consistency constraint [77]. 

Graph-based system (GBS) automatically weights the constructed graph of each view, 

and then generates a unified graph matrix [26]. Although it dynamically generates the 

weight of each graph matrix, GBS needs the number of neighbors as a prior. 

Furthermore the learning of the unified graph and the constructing graphs are in two 

separate stages. Adaptively weighted Procrustes (AWP) weights each view according 

to its clustering capacity and forms a weighted Procrustes average problem accordingly 

[27]. AWP requires spectral embedding matrix calculated beforehand as an input. The 


Chapter 2. Literature Review 

 
23 
 

goal of conducting the spectral embedding matrix is different from the second stage of 

multi-view clustering, and thus not guaranteed to always have optimal performance. 

Multi-view low-rank sparse subspace clustering (MLRSSC) jointly learns an affinity 

matrix constrained by sparsity and low-rank, while at the same time balances between 

the agreements across different views [28]. MLRSSC learns the joint affinity matrix 

first, and then uses the spectral clustering algorithm to complete the final clustering. 

The learning of the affinity matrix and final spectral clustering are in two separate 

stages. Thus, it cannot guarantee to always have optimal clustering results. 

2.2 Feature Selection 

Real-world data sets are rich in information. They often contain high-dimensional 

features. However, not all features are effective for clustering algorithms. The high-

dimensional features not only increase the computational time for machine learning, 

but also increasing risk of overfitting. Dimensionality reduction aims to reduce the 

dimensions of data by obtaining a set of principal data or removing the redundant and 

dependent features [78]. It transforms the features from a high dimensional space to a 

low dimensional space. It could be applied to reduce the complexity, avoid overfitting, 

and reduce the influence of outliers. Feature selection is one of dimensionality reduction 

approaches. Feature selection is for selecting useful features from the original features 

or filtering irrelevant or redundant features from the original data set. The 

feature selection techniques can be broadly categorized into three types: the filter 


Chapter 2. Literature Review 

 
24 
 

feature selection methods, the wrapper feature selection methods and the embedded 

feature selection methods.  

The filter feature selection methods filter out unimportant or redundant features 

from the original data set based on certain criteria [79]. Mutual Information or 

correlation to select the most relevant features [80]. Feature selection for multi-labeled 

variables method selects features via maximizing conditional dependency between 

features [79]. An unsupervised filter feature selection method for mixed data (USFSM) 

evaluates the relevance of features by their contributions and defines good cluster 

structures by analysing the changes of spectrum of the normalized Laplacian 

matrix when a feature is excluded [81]. The filter techniques have advantages of their 

speed and scalability [82, 83]. Filter methods are useful for selecting a generic set of 

features for all the machine learning models. The filter techniques have advantages of 

their speed and scalability. However, in some cases, features selected through filter 

methods may not be the most optimal set of features for some specific algorithms.  

The wrapper feature selection methods are used to select the most optimal 

features for the specified algorithms [84, 85]. There are different wrapper approaches. 

A meta-heuristic wrapper method uses random encircling and imitative behavior of the 

Kestrel bird for optimal selection of features [86]. The sequential approach adds or 

removes features sequentially; the bio-inspired approach introduces randomness into 

the process to gain global optima; the iterative approach converts the feature selection 

problem to an estimation problem [81]. A sequential methods outputs both a ranking of 

relevant features and an optimal partition by using Mahalanobis metric (multivariate 

https://www.sciencedirect.com/topics/engineering/feature-extraction
https://www.sciencedirect.com/topics/computer-science/cluster-structure
https://www.sciencedirect.com/topics/computer-science/cluster-structure
https://www.sciencedirect.com/topics/computer-science/laplacian-matrix
https://www.sciencedirect.com/topics/computer-science/laplacian-matrix


Chapter 2. Literature Review 

 
25 
 

distance metric which measures the distance between a data point and a distribution) 

and K-means clustering algorithm [84]. Localized feature selection (LFS), an iterative 

algorithm, uses a randomized rounding approach when weights of regions are fixed 

[85]. However, traditional wrapper methods usually have poor generalization ability, 

high complexity, low computational efficiency, and high computational cost [82, 87].  

In embedded approaches of feature selection, the feature selection is an 

integrated part of the learning algorithm. The embedded approaches can be generally 

divided into two types: decision tree algorithms and regularization techniques. Decision 

tree algorithms select features recursively during the tree growth process [88, 89]. The 

tree growth process is also the process of feature selection. Some feature selection 

methods based on bee colony and gradient boosting decision tree [88]. Some use 

classification and regression tree-based (CART) decision tree algorithms to select 

features for 3D depth video [89]. L1-norm, L2-norm, or L2,1-norm have been used for 

feature selection in regularization techniques-based algorithms, which objective 

function is the minimization of the regularized cost. The key difference between the 

regularization techniques is the regularization term or penalty term. In L1-norm 

regularization, the absolute value of the magnitude of the coefficient is the penalty term. 

In L2-norm regularization, the squared magnitude of the coefficient is penalty term. In 

L2,1-norm regularization, the penalty term is a non-squared magnitude of the coefficient.  

For the matrix 𝐌𝐌 ∈ ℝ𝑛𝑛×𝑚𝑚, the L1-norm, L2-norm, and L2,1-norm are defined in 

Eq. 2.1, Eq. 2.2 and Eq. 2.3 respectively [90]:  

                              ‖𝐌𝐌‖1 = ∑ ∑ �𝑚𝑚𝑖𝑖,𝑗𝑗�𝑚𝑚
𝑗𝑗=1

𝑛𝑛
𝑖𝑖=1   (2.1) 


Chapter 2. Literature Review 

 
26 
 

                              ‖𝐌𝐌‖2 = (∑ ∑ 𝑚𝑚𝑖𝑖,𝑗𝑗
2𝑚𝑚

𝑗𝑗=1
𝑛𝑛
𝑖𝑖=1 )   (2.2) 

  ‖𝐌𝐌‖2,1 = ∑ (∑ 𝑚𝑚𝑖𝑖,𝑗𝑗
2𝑚𝑚

𝑗𝑗=1
𝑛𝑛
𝑖𝑖=1 )1/2  (2.3) 

Specifically, L1-norm generates element-wise sparsity while L2,1-norm 

generates row-wise sparsity. That is, by using L2,1-norm penalty on the regularization 

term, it makes some rows of the generated projection matrix be 0. The redundant 

features are filtered out as unrepresentative features corresponding to row-wise sparsity 

do not participate in the clustering process, L2,1-norm-based algorithms have a better 

interpretability than L1-norm-based algorithms in feature selection models [91, 92]. L2 

-norm can’t generate sparsity, which means it is lack of effectiveness in the feature 

selection model [93].  The L2,1-norm-based approaches are more robust than the L2-

norm-based approaches.   

Recently the L2,1-norm has been used to improve the robustness of the feature 

selection algorithms [94, 95]. For instance, the L2,1-norm regularization term is imposed 

to the objective function to achieve feature selection and capture the discriminative 

structure information [94]. The L2,1-norm is used on both reconstruction error and the 

sparse constraint term to extract representative 2D image features [95]. L2,1-norm 

regularized regression model used for joint feature selection from multiple tasks.  

 
2.3 Outlier Reduction  

Real data often contains outliers, which are data points inconsistent with most of the 

other data points in a given data set [96, 97]. The outliers could be resulted from an 


Chapter 2. Literature Review 

 
27 
 

inadequate procedure of data measure, collection, and data handling, or due to inherent 

variability in the underlying data domain. The outliers could significantly affect the 

clustering results. Outlier detection and robust clustering algorithms are often used to 

tackle the outlier problem.  

Outlier detection algorithms detect those outliers which are data points deviated 

from most of the other data points. Most of the existing outlier detection studies focus 

on unsupervised outlier detection [98]. Examples of outlier detection algorithms 

include distance-based outlier detection [99], dimension-based outlier detection [100], 

density-based outlier detection [101], frequent pattern based outlier detection [102, 

103], and cluster-based outlier detection [104], etc. 

To minimize the impact of outliers, robust clustering has been intended from 

different areas. Some algorithms learn a robust metric to measure the similarity between 

points by taking the outliers into account [105, 106]; some algorithms use L1  or L2,1-

norm to remove the outliers [107, 108]; some algorithms assign different weights to the 

data and the outliers during the clustering process [109]; some algorithms decompose 

outliers into a low-rank part [66, 110]; some algorithms conduct ensemble or fusion-

based clustering algorithms combine different partitions results to deliver a more robust 

result [111, 112]. L1-norm, L2-norm, or L2,1-norm have been used on the regularization 

terms of the objective functions of clustering algorithm [113]. A non-convex multi-task 

generalization of the L2,1-norm regularization is used to learn a few features common 

across multiple tasks [114]. L2,1-norm regularized regression model used for joint 


Chapter 2. Literature Review 

 
28 
 

feature selection from multiple tasks [115]. L2,1-norm regularization encourages 

multiple predictors to share similar sparsity patterns [115].  

Formally, let 𝐗𝐗 = (𝐱𝐱1, 𝐱𝐱2, … , 𝐱𝐱𝑛𝑛) ∈ ℝ𝑝𝑝×𝑛𝑛 , 𝐕𝐕 = (𝐯𝐯1, 𝐯𝐯2, … , 𝐯𝐯𝑛𝑛) ∈ ℝ𝑘𝑘×𝑛𝑛 , and  

𝐔𝐔 ∈ ℝ𝑝𝑝×𝑘𝑘. In L1-norm-based robust clustering algorithms, the absolute value of the 

magnitude of the coefficient is used in the loss function.  

 min
𝐔𝐔,𝐕𝐕

E1 (𝐔𝐔,𝐕𝐕) = ‖𝐗𝐗 − 𝐔𝐔𝐕𝐕‖1      (2.4) 

Specifically, L1-norm generates element-wise sparsity. As outliers 

corresponding to row-wise sparsity instead of element-wise sparsity, L1-norm based 

algorithms do not have a good interpretability in the outlier reduction.  

In L2-norm-based robust clustering algorithms, the squared magnitude of the 

coefficient is penalty term.  

 min
𝐔𝐔,𝐕𝐕

E2 (𝐔𝐔,𝐕𝐕) = ‖𝐗𝐗 − 𝐔𝐔𝐕𝐕‖𝐹𝐹2   (2.5) 

The L2-norm is calculated as the square root of the sum of the squared vector 

values. For example, an outlier, its residual ‖𝐱𝐱𝑖𝑖 − 𝐔𝐔𝐯𝐯𝑖𝑖‖ is larger than residuals of other 

non-outliers. After squaring, the residual of the outlier could dominate the loss function. 

The L2-norm is also used to calculate the Euclidean distance of the vector coordinate 

from the origin of the vector space. Euclidean distance is often used in clustering 

algorithm to calculate the similarity. The L2-norm based is also called Euclidean norm.    

The L2,1-norm  is defined in the following equation: 

 min
𝐔𝐔,𝐕𝐕

E2,1 (𝐔𝐔,𝐕𝐕) = ‖𝐗𝐗 − 𝐔𝐔𝐕𝐕‖2,1
   (2.6) 


Chapter 2. Literature Review 

 
29 
 

While L2,1-norm generates row-wise sparsity. As outliers corresponding to row-

wise sparsity do not participate in the clustering process, L2,1-norm-based  algorithms 

have a better interpretability than L1-norm-based  algorithms in outlier removal [91, 

92]. The residual ‖𝐱𝐱𝑖𝑖 − 𝐔𝐔𝐯𝐯𝑖𝑖‖ of an outlier is not squared, and thus reduces the influence 

of the outlier compared to L2-norm-based loss function. Thus, L2,1-norm-based 

algorithm could achieve more robust clustering results compared to L2-norm-based 

algorithm. The L2,1 performs more robustly and stable than L2 when outliers exist [116]. 

According to the structure of the constraints, the structural sparsity is often obtained by 

L2,1-norm. L2,1-norm regularization encourages multiple predictors to share similar 

sparsity patterns [115]. L2,1-norm-based  function is robust to outliers [117, 118].  

2.4 Evaluation Measure 

To assess the performance of the proposed algorithms with related algorithms, we 

adopted three popular evaluation metrics of clustering algorithms including accuracy 

(ACC), normalized mutual information (NMI), and Purity [119]. ACC measures the 

percentage of samples correctly clustered. NMI measures the pairwise similarity 

between two partitions. Purity measures the percentage of each cluster containing the 

correctly clustered samples [11, 120]. The definitions of these three evaluation metrics 

are given below. 

 ACC =  𝑁𝑁𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐/𝑁𝑁  (2.7) 


Chapter 2. Literature Review 

 
30 
 

where 𝑁𝑁𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 represents the number of correct clustered samples, and 𝑁𝑁 represents 

total number of samples. 

 NMI (𝐴𝐴,𝐵𝐵) =
 ∑ ∑ 𝑛𝑛𝑖𝑖𝑖𝑖

𝐶𝐶𝐵𝐵
𝑖𝑖=1

𝐶𝐶𝐴𝐴
𝑖𝑖=1  log(𝑛𝑛𝑖𝑖𝑖𝑖𝑛𝑛 𝑛𝑛𝑖𝑖

𝐴𝐴⁄ 𝑛𝑛𝑖𝑖
𝐵𝐵)

�∑ 𝑛𝑛𝑖𝑖
𝐴𝐴log (𝑛𝑛𝑖𝑖

𝐴𝐴 𝑛𝑛)∑ 𝑛𝑛𝑖𝑖
𝐵𝐵log (𝑛𝑛𝑖𝑖

𝐵𝐵 𝑛𝑛)�𝐶𝐶𝐵𝐵
𝑖𝑖=1�𝐶𝐶𝐴𝐴

𝑖𝑖=1

   (2.8) 

where 𝐴𝐴  and 𝐵𝐵  represents two partitions of 𝑛𝑛  samples into 𝐶𝐶𝐴𝐴  and 𝐶𝐶𝐵𝐵  clusters 

respectively.  

 Purity = ∑ (𝑆𝑆𝑖𝑖 𝑛𝑛⁄ )𝑘𝑘
𝑖𝑖=1 𝑃𝑃𝑖𝑖   (2.9) 

where 𝑘𝑘 represents number of clusters and 𝑛𝑛 represents total number of samples. 𝑆𝑆𝑖𝑖 

represents the number of samples in the i-th cluster. 𝑃𝑃𝑖𝑖 represents the distribution of 

correctly clustered sample.  

To rank the performance of different algorithms, we used dense ranking which 

the highest accuracy rate receives number 1, and the next accuracy rate receives the 

immediately following ranking number. Same accuracy rates receive the same ranking 

number. Thus if A ranks ahead of B and C (which compare equal) which are both 

ranked ahead of D, then A gets ranking number 1 ("first"), B gets ranking number 2 

("joint second"), C also gets ranking number 2 ("joint second") and D gets ranking 

number 3 ("Third"). 

2.5 Summary 

As one of the most famous and widely used clustering algorithms, K-means clustering 

algorithm still has its limitations. It is difficult to determine the cluster number K to 


Chapter 2. Literature Review 

 
31 
 

obtain a good clustering result without prior knowledge. Different initializations may 

obtain completely different clustering results. Using Euclidian distance as similarity 

measurement is limited for measuring the real-world data. Real-world data contains 

redundant features and outliers, without considering the reduction of the influence of 

redundant features and outliers is hard to achieve the optimal results. Existing 

methods only solved some of these problems. All these issues of K-means clustering 

algorithm are important to be addressed to improve K-means clustering algorithm. 


Chapter 3. Initialization-Similarity Clustering Algorithm 

32 
 

Chapter 3  

Initialization-Similarity Clustering Algorithm 

3.1 Introduction 

Due to random initialization and the Euclidian distance as similarity measure, K-means 

clustering algorithm does not guarantee to produce optimal and stable results. Many 

literatures have solved the part of issues problem of K-means clustering algorithm [4, 

13-15, 121, 122]. However, previous research focused on solving a part of these issues 

but has not focused on solving the initialization and the similarity measure in a unified 

framework. As an innovative clustering method, spectral clustering algorithm has 

widely applied in the fields such as data mining, computer vision, machine learning, 

and pattern recognition over recent years [123, 124]. To fix the similarity measure issue 

of K-means clustering algorithm, Spectral clustering algorithm generates the similarity 

matrix, and then obtain the spectral representation, finally applies K-means clustering 

algorithm to get the final clustering results. Fixing one of the two issues does not 

guarantee the best performance. Solving similarity and initialization issues of K-means 

clustering algorithm simultaneously can be considered as an improvement over the 

existing algorithms because it could lead to better outputs.  

The proposed Initialization-Similarity (IS) clustering algorithm aims to solving 

the initialization and the similarity measure issues simultaneously. Specifically, we fix 

the initialization of the clustering by using sum-of-norms (SON) regularization [125]. 

Moreover, the SON regularization outputs the new representation of the original 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
33 
 

samples. The proposed IS clustering algorithm then learns the similarity matrix based 

on the data distribution. That is, the similarity is high if the distance of the new 

representation of the data points is small. Furthermore, the derived new representation 

is used to conduct K-means clustering. Finally, we employ an alternating strategy to 

solving the proposed objective function. Experimental results on real-world benchmark 

data sets demonstrate that IS clustering algorithm outperforms the comparison 

clustering algorithms in terms of three evaluation metrics for clustering algorithm 

including accuracy (ACC), normalized mutual information (NMI), and Purity. 

We briefly summarize the contributions of the proposed IS clustering algorithm 

as follows: 

• IS clustering algorithm fixes the initialization by using the sum-of-norms 

regularization makes the clustering robust and reproduced. In contrast, the 

previous clustering algorithm uses randomly selected cluster centers initialization 

to conduct K-means clustering and then outputs unstable or varying clustering 

results [126].   

• Previous spectral clustering algorithm uses spectral representation to replace 

original representation for conducting K-means clustering. To do this, spectral 

clustering algorithm first generates the similarity matrix and then conducts 

eigenvalue decomposition on the Laplacian matrix of the similarity matrix to 

obtain the spectral representation. This is obviously a two-step strategy which the 

goal of the first step does not guarantee the best clustering result. However, IS 

clustering algorithm learns the similarity matrix and the new representation 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
34 
 

simultaneously. The performance is more promising when the two steps are 

combined in a unified way.  

• The experiment results on ten public data sets show that the proposed IS 

clustering algorithm outperforms both K-means clustering and spectral clustering 

algorithms. It implies that simultaneously addressing the two issues of K-means 

clustering algorithm is feasible and fitter. 

This section has laid the background of the research inquiry. The remainder of 

the paper is organized as follows: Section 3.2 discusses the motivation behind the 

development of IS clustering algorithm. Section 3.3 introduces the proposed 

Initialization-Similarity (IS) algorithm. Section 3.4 provides the optimization process. 

Section 3.5 provides the convergence analysis. Section 3.6 discusses the experiments 

we conducted and presents the results of the experiments. The conclusions, limitations 

and future research direction are presented in Section 3.7.  

3.2 Motivation 

To discover how other algorithm improves K-means clustering algorithm, we 

investigated both K-means clustering algorithm and Spectral clustering algorithm, 

another widely used clustering algorithm, in details.  

K-means algorithm aims at minimizing the total intra-cluster variance 

represented by an objective function known as the squared error function shown in Eq. 

(3.1).  


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
35 
 

   ∑  𝐾𝐾
𝑗𝑗=1 ∑ �𝑥𝑥𝑖𝑖 − ℎ𝑗𝑗�

2𝑑𝑑𝑖𝑖
𝑖𝑖=1   (3.1) 

where K is the cluster number, 𝑑𝑑𝑗𝑗 is the number of data points in the j-th cluster, 𝑥𝑥𝑖𝑖 is 

the i-th data point of cluster j. ℎ𝑗𝑗 is the cluster center of cluster j-th cluster. �𝑥𝑥𝑖𝑖 − ℎ𝑗𝑗�
2 is 

the Euclidean distance between 𝑥𝑥𝑖𝑖  and ℎ𝑗𝑗.  

K-means clustering algorithm can be reformulated as the formulation of 

nonnegative matrix factorization as following Eq. (3.2) [127]:  

  min
𝐇𝐇,𝐅𝐅

  
‖𝐗𝐗 − 𝐅𝐅𝐇𝐇‖F2     (3.2) 

where 𝐅𝐅 ∈ ℝ𝑛𝑛×𝑘𝑘 is the cluster indicator matrix of 𝐗𝐗 ∈ ℝ𝑛𝑛×𝑘𝑘  and 𝐇𝐇 ∈ ℝ𝑘𝑘×d  is the 

cluster center matrix. 

K-means clustering algorithm randomly chooses the initial cluster centers. 

Based on both Eq. (3.1) and Eq. (3.2) , it is obvious that different initialization methods 

may have different effects on the clustering results [128, 129]. This implies that it is 

difficult to reproduce the clustering results. Some algorithms were developed to address 

this issue. For example, the algorithm used for novel centroid selection approaches for 

K-means-clustering based recommender systems first select one random data point as 

initial cluster center, then select next cluster center with probability until all K cluster 

centers are found. The first cluster center is still selected randomly, which will affect 

the clustering results [128]. Random swap-based algorithms such as an efficiency of 

random swap clustering algorithm first select the cluster centers randomly, then 

randomly select one cluster center to be removed and replace it to a randomly selected 

cluster. This is a trial-and-error approach and it doesn’t have clear iteration times [130].  


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
36 
 

Moreover, Eq. (3.2) also shows that the outcome of the K-means clustering 

objective function only depends on Euclidean distance between the data points and the 

cluster center, which is how K-means clustering algorithm defines the similarity 

measure between two data points. The smaller the distance between two data points, 

the more similar the two data points are. The larger the distance between two data 

points, the more dissimilar the two data points are. Euclidean distance does not reveal 

other underlying factors such as cluster sizes, shape, dependent features or density [18, 

30]. Thus the similarity measure is an issue of K-means clustering algorithm. To address 

the similarity measure issue of K-means algorithm, spectral clustering algorithm uses 

spectral representation to replace original representation. To achieve this, spectral 

clustering algorithm first builds a similarity matrix and conducts eigenvalue 

decomposition on its Laplacian matrix to obtain the spectral representation. The pseudo 

code for K-means clustering algorithm is list in Table 3.1. 

Table 3.1 The pseudo code for K-means clustering algorithm 
 

A spectral clustering algorithm creates a similarity matrix first and then defines 

a feature vector. Then it runs the K-means clustering algorithm to conduct clustering 

Input: X (data matrix), K (the cluster number) 

Output: K cluster centers and the cluster indicator of each data point 

Initialization: 

Random selecting K cluster centers h1, h2 … h𝑘𝑘; 

Repeat: 

1. Assign each data point x𝑖𝑖 to nearest cluster j using Euclidian distance;  
2. Recalculating the new cluster centers h1, h2 … h𝑘𝑘; 

Until convergence (the cluster indicator of each data points unchanged); 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
37 
 

[61]. Thus, a spectral clustering algorithm finds the data similarity matrix and spectral 

representation in separate stages. Of course, its use of the K-means clustering algorithm 

requires the cluster number beforehand. Other algorithms e.g. CAN learn the data 

similarity matrix and clustering structure simultaneously, but again needs to know the 

cluster number beforehand. In the algorithm RCC, clustering is managed without the 

prior knowledge of the cluster number by continuously optimizing an objective 

function based on robust estimation. However, this needs a good similarity matrix 

calculated beforehand as an input to be able to produce good clustering outcome. The 

pseudo code for spectral clustering algorithm is shown in Table 3.2. 

Table 3.2 The pseudo code for the spectral clustering algorithm 

Input: X∈ ℝ𝑛𝑛×𝑑𝑑 (data matrix), K (the cluster number) 

Output: K cluster center and the cluster indicator of each data point 

• Computing 𝐒𝐒 ∈ ℝ𝑛𝑛×𝑛𝑛 to measure the similarity between any data point pair; 
• Computing L =  D – S, where 𝐃𝐃 = [𝑑𝑑𝑖𝑖𝑗𝑗]𝑛𝑛×𝑛𝑛 is a diagonal matrix and 𝑑𝑑𝑖𝑖𝑖𝑖 =

∑ (s𝑖𝑖𝑗𝑗 + s𝑗𝑗𝑖𝑖)/2 
𝑗𝑗 ; 

• Generating spectral representation using the eigenvectors and eigenvalues 
of L;  

• Conducting K-means clustering on the spectral representation; 
 

Obviously, spectral clustering algorithm replacing original representation with 

spectral representation deals the issue of similarity measure in K-means clustering 

algorithm. However, spectral clustering algorithm separately learns the similarity 

matrix and the spectral representation, as knowns as a two-stage strategy, where the 

goal of constructing the similarity matrix in the first stage does not aim at achieving 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
38 
 

optimal spectral representation, and thus not guaranteeing to always outperform K-

means clustering algorithm.  

3.3 Proposed Algorithm 

This thesis proposes a new clustering algorithm (i.e., Initialization-Similarity (IS)) to 

simultaneously solve the initialization and similarity measure issues of K-means 

clustering algorithm in a unified framework. Specifically, IS clustering algorithm uses 

the sum-of-norms regularization to investigate the initialization issue, and jointly learns 

the similarity matrix and the spectral representation to overcome the issue of the multi-

stage strategy of spectral clustering algorithm. To achieve the goal, we form the 

objective function of the IS clustering algorithm as follows: 

min𝐒𝐒,𝐔𝐔  1
2
‖𝐗𝐗 − 𝐔𝐔‖𝐹𝐹2 

 + 𝛼𝛼
2
∑ 𝑠𝑠𝑖𝑖,𝑗𝑗𝜌𝜌(�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2

 𝑛𝑛
𝑖𝑖,𝑗𝑗=1 ) + 𝛽𝛽‖𝐒𝐒‖22, 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝐬𝐬𝑖𝑖𝑇𝑇𝐞𝐞 = 1

 (3.3) 

where 𝐗𝐗 ∈ ℝ𝑛𝑛×𝑑𝑑 is the data matrix, 𝐔𝐔 ∈ ℝ𝑛𝑛×𝑑𝑑  is the new representation of  𝐗𝐗, and 𝐒𝐒 ∈

ℝ𝑛𝑛×𝑛𝑛  is the similarity matrix to measure the similarity among data points. 𝜌𝜌(�u𝑖𝑖 −

u𝑗𝑗�2
 ) is an implicit function, as known as robust loss function in robust statistics. 

Equation. (3.3) aims at learning the new representation U and fixes the 

initialization of clustering. Moreover, Eq. (3.3) learns the new representation U as well 

as considers the similarity among data points, i.e., the higher the similarity 𝑠𝑠𝑖𝑖,𝑗𝑗 between 

two data points, the smaller their corresponding new representation (𝐮𝐮𝑖𝑖  and 𝐮𝐮𝑗𝑗 ) is. 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
39 
 

Furthermore, we learn the similarity matrix 𝐒𝐒 based on the sample distribution, i.e., 

iteratively updated by the updated U. This makes the new representation reasonable. 

Several robust loss functions have been proposed in robust statistics [131, 132]. 

In this thesis, we employ the Geman-McClure function [133] as follows: 

 ρ ��𝐮𝐮𝑝𝑝 − 𝐮𝐮𝑞𝑞�2
 � =

𝜇𝜇�𝐮𝐮𝑝𝑝−𝐮𝐮𝑞𝑞�2
2

𝜇𝜇+�𝐮𝐮𝑝𝑝−𝐮𝐮𝑞𝑞�2
2  (3.4) 

Equation. (3.4) is often used to measure how good a prediction model does in 

terms of being able to predict the expected outcome. The closer the distance is, the 

smaller value of �𝐮𝐮𝑝𝑝 − 𝐮𝐮𝑞𝑞�2  is, and the higher the similarity 𝑠𝑠𝑝𝑝,𝑞𝑞 is.  With the update 

of other parameters in Eq. (3.3), the distance �𝐮𝐮𝑝𝑝 − 𝐮𝐮𝑞𝑞�2
 
for some 𝑝𝑝, 𝑞𝑞, will be very 

close, or even 𝐮𝐮𝑝𝑝 = 𝐮𝐮𝑞𝑞. In this way, the clusters will be determined.  

Algorithm 3.1. The pseudo code for IS clustering algorithm. 

Input: X∈ ℝ𝑛𝑛×𝑑𝑑 
Output: a set of K clusters 

Initialization: U = X;  

 
Repeat: 

 
• Update 𝐅𝐅 using Eq. (3.13) 

 
• Update S using Eq. (3.22) 

 
• Update U using Eq. (3.36) 

 
Until U converges 

 • Apply K-means clustering algorithm on U 

In robust statistics, the optimization of the robust loss function is usually 

difficult or inefficient. To address this, it is normal for introducing an auxiliary variable 

𝑓𝑓𝑖𝑖,𝑗𝑗  and a penalty item 𝜑𝜑(𝑓𝑓𝑖𝑖,𝑗𝑗) [134-136], and thus Eq. (3.3) is equivalent to: 

 min𝐒𝐒,𝐔𝐔,𝐅𝐅
1
2
� ‖𝐱𝐱𝑖𝑖 − 𝐮𝐮𝑖𝑖‖22

𝑛𝑛

𝑖𝑖=1
+
α
2
� 𝑠𝑠𝑖𝑖,𝑗𝑗(𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2

2𝑛𝑛

𝑖𝑖,𝑗𝑗=1
  

Chapter 3. Initialization-Similarity Clustering Algorithm 

 
40 
 

+φ(𝑓𝑓𝑖𝑖,𝑗𝑗)) + 𝛽𝛽 ∑ ‖𝐬𝐬𝑖𝑖‖22𝑛𝑛
𝑖𝑖=1   𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝒔𝒔𝑖𝑖𝑇𝑇𝒆𝒆 = 1  (3.5) 

where 𝜑𝜑�𝑓𝑓𝑖𝑖,𝑗𝑗� = 𝜇𝜇(�𝑓𝑓𝑖𝑖,𝑗𝑗 − 1)2, 𝑖𝑖, 𝑗𝑗 = 1 …𝑛𝑛 

3.4 Optimization 

Equation. (3.5) is not jointly convex on 𝐅𝐅, 𝐔𝐔, and 𝐒𝐒, but is convex on each variable 

while fixing the rest. To solving the Eq. (3.5), the alternating optimization strategy is 

applied. We optimize each variable while fixing the rest until the algorithm converges. 

The pseudo-code of IS clustering algorithm is given in Algorithm 3.1. 

    
1) Update 𝐅𝐅 while fixing 𝐒𝐒 and 𝐔𝐔  

While 𝐒𝐒 and 𝐔𝐔 are fixed, the objective function can be rewritten in a simplified matrix 

form to optimize 𝐅𝐅: 

 min𝐅𝐅
𝛼𝛼
2
∑ 𝑠𝑠𝑖𝑖,𝑗𝑗(𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2

2𝑛𝑛
𝑖𝑖,𝑗𝑗=1 + 𝜇𝜇(�𝑓𝑓𝑖𝑖,𝑗𝑗 − 1)2)  (3.6) 

Since the optimization of 𝑓𝑓𝑖𝑖,𝑗𝑗  is independent of the optimization of other 

𝑓𝑓𝑝𝑝,𝑞𝑞, 𝑖𝑖 ≠ 𝑝𝑝, 𝑗𝑗 ≠ 𝑞𝑞, the 𝑓𝑓𝑖𝑖,𝑗𝑗   is optimized first as shown in following Eq. (3.7) 

 min𝑓𝑓𝑖𝑖,𝑖𝑖  𝛼𝛼
2

(𝑠𝑠𝑖𝑖,𝑗𝑗𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2
2 + 𝑠𝑠𝑖𝑖,𝑗𝑗�𝜇𝜇�𝑓𝑓𝑖𝑖,𝑗𝑗 − 2�𝑓𝑓𝑖𝑖,𝑗𝑗 + 1�

 
�  (3.7) 

By conducting a derivative on Eq. (3.7) with respect to 𝑓𝑓𝑖𝑖,𝑗𝑗, we get  

 𝛼𝛼
2

(𝑠𝑠𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2
2 + 𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇−𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇𝑓𝑓𝑖𝑖,𝑗𝑗

−12
 
) = 0    (3.8) 

⇒ 𝛼𝛼
2

  𝑠𝑠𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2
2 + 𝛼𝛼

2
 𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇−

𝛼𝛼
2
𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇𝑓𝑓𝑖𝑖,𝑗𝑗

−12
 

= 0   (3.9) 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
41 
 

⇒ 𝛼𝛼
2

  𝑠𝑠𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2
2 + 𝛼𝛼

2
 𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇 = 𝛼𝛼

2
𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇𝑓𝑓𝑖𝑖,𝑗𝑗

−12
  
  (3.10) 

   ⇒ 𝑓𝑓𝑖𝑖,𝑗𝑗
−12

  
=

  �𝐮𝐮𝑖𝑖−𝐮𝐮𝑖𝑖�2
2
+ 𝜇𝜇

𝜇𝜇
      (3.11) 

  ⇒  𝑓𝑓𝑖𝑖,𝑗𝑗
1
2

  
=    𝜇𝜇

𝜇𝜇+�𝐮𝐮𝑖𝑖−𝐮𝐮𝑖𝑖�2
2     (3.12) 

  ⇒  𝑓𝑓𝑖𝑖,𝑗𝑗
   = �    𝜇𝜇

𝜇𝜇+�𝐮𝐮𝑖𝑖−𝐮𝐮𝑖𝑖�2
2�

2

     (3.13) 

 
2) Update 𝐒𝐒 while fixing 𝐔𝐔 and 𝐅𝐅 

While fixing 𝐔𝐔 and 𝐅𝐅, the objective function Eq. (3.5) with respect to 𝐒𝐒 is: 

min𝐒𝐒
𝛼𝛼
2
∑ (𝑠𝑠𝑖𝑖,𝑗𝑗𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2

2𝑛𝑛
𝑖𝑖,𝑗𝑗=1 + 𝑠𝑠𝑖𝑖,𝑗𝑗(𝜇𝜇(�𝑓𝑓𝑖𝑖,𝑗𝑗 − 1)2)) + 𝛽𝛽 ∑ ‖𝐬𝐬𝑖𝑖‖22𝑛𝑛

𝑖𝑖=1   (3.14) 

𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, s𝑖𝑖𝑇𝑇e = I 

Since the optimization of 𝐬𝐬i is independent of the optimization of other 𝐬𝐬𝑗𝑗 , 𝑖𝑖 ≠

j, 𝑖𝑖, 𝑗𝑗 = 1, … , n, the 𝐬𝐬𝑖𝑖 is optimized first as shown in following: 

  min𝐬𝐬𝑖𝑖 
   𝛼𝛼

2
∑ 𝑠𝑠𝑖𝑖,𝑗𝑗(𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2

2𝑛𝑛
𝑗𝑗=1 + 𝜇𝜇(�𝑓𝑓𝑖𝑖,𝑗𝑗 − 1)2) + 𝛽𝛽‖𝐬𝐬𝑖𝑖‖22  (3.15) 

𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, s𝑖𝑖𝑇𝑇e = 1 

Let 𝑏𝑏𝑖𝑖,𝑗𝑗 = 𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2
2 and  𝑐𝑐𝑖𝑖,𝑗𝑗 = 𝜇𝜇(�𝑓𝑓𝑖𝑖,𝑗𝑗 − 1)2, Eq. (3.15) is equivalent to: 

 min𝐬𝐬𝑖𝑖    𝛼𝛼
2
∑ 𝑠𝑠𝑖𝑖,𝑗𝑗𝑏𝑏𝑖𝑖,𝑗𝑗𝑛𝑛
𝑗𝑗=1 + 𝛼𝛼

2
∑ 𝑠𝑠𝑖𝑖,𝑗𝑗𝑐𝑐𝑖𝑖,𝑗𝑗𝑛𝑛
𝑗𝑗=1 + 𝛽𝛽‖𝐬𝐬𝑖𝑖‖22, 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝒔𝒔𝑖𝑖𝑇𝑇𝑒𝑒 = 1 

(3.16) 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
42 
 

 ⇒ min𝐬𝐬𝑖𝑖  𝛼𝛼
2
𝐬𝐬𝑖𝑖𝑇𝑇𝐛𝐛𝑖𝑖 + 𝛼𝛼

2
𝐬𝐬𝑖𝑖𝑇𝑇𝐜𝐜𝑖𝑖 + 𝛽𝛽‖𝐬𝐬𝑖𝑖‖22,    𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝐬𝐬𝑖𝑖𝑇𝑇𝑒𝑒 = 1 (3.17) 

 ⇒ min𝐬𝐬𝑖𝑖  𝛼𝛼
2
𝐬𝐬𝑖𝑖𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖) + 𝛽𝛽𝐬𝐬𝑖𝑖𝑇𝑇𝐬𝐬𝑖𝑖 ,     𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝐬𝐬𝑖𝑖𝑇𝑇𝑒𝑒 = 1  (3.18) 

⇒ min𝐬𝐬𝑖𝑖  𝛼𝛼
2𝛽𝛽
𝐬𝐬𝑖𝑖𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖) + 𝐬𝐬𝑖𝑖𝑇𝑇𝐬𝐬𝑖𝑖 ,    𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, s𝑖𝑖𝑇𝑇𝑒𝑒 = 1  (3.19) 

 ⇒ min𝐬𝐬𝑖𝑖  𝐬𝐬𝑖𝑖
𝑇𝑇𝐬𝐬𝑖𝑖 + 2𝐬𝐬𝑖𝑖 

𝛼𝛼
4𝛽𝛽

𝐬𝐬𝑖𝑖𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖) +
𝛼𝛼

4𝛽𝛽
𝐬𝐬𝑖𝑖𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖)𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖) 

−  𝛼𝛼
4𝛽𝛽

s𝑖𝑖𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖)𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖), 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝐬𝐬𝑖𝑖𝑇𝑇𝑒𝑒 = 1  (3.20) 

 ⇒ min𝐬𝐬𝑖𝑖  �𝐬𝐬𝑖𝑖 + 𝛼𝛼
4𝛽𝛽

(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖)�
2

2
, 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝐬𝐬𝑖𝑖𝑇𝑇e = 1  (3.21) 

According to Karush-Kuhn-Tucker (KKT) [137], the optimal solution 𝐬𝐬𝑖𝑖 should be 

 S𝑖𝑖,𝑗𝑗 = max {− 𝛼𝛼
4𝛽𝛽

(b𝑖𝑖,𝑗𝑗 + c𝑖𝑖,𝑗𝑗)} + 𝜃𝜃, 0}, 𝑗𝑗 = 1, … ,𝑛𝑛  (3.22) 

where 𝜃𝜃 = 1
𝜌𝜌
∑ � 𝛼𝛼

4𝛽𝛽
(b𝑖𝑖,𝑗𝑗 + c𝑖𝑖,𝑗𝑗) + 1�𝜌𝜌

𝑗𝑗=1 , and 𝜔𝜔 is the descending order of 𝛼𝛼
4𝛽𝛽

(b𝑖𝑖,𝑗𝑗 +

c𝑖𝑖,𝑗𝑗). and 𝜌𝜌 = {𝝎𝝎𝑗𝑗 −
1
𝑗𝑗
�∑ 𝝎𝝎𝑐𝑐

𝑗𝑗
𝑐𝑐=1 − 1�, 0}𝑗𝑗

𝑚𝑚𝑚𝑚𝑚𝑚 . 

 
3) Update 𝐔𝐔 while fixing 𝐒𝐒 and 𝐅𝐅  

While 𝐒𝐒 and 𝐅𝐅 are fixed, the objective function can be rewritten in a simplified form to 

optimize 𝐔𝐔: 

 min𝐔𝐔  
1
2
∑ ‖𝐱𝐱𝑖𝑖 − 𝐮𝐮𝑖𝑖‖22𝑛𝑛
𝑖𝑖,𝑗𝑗=1 + 𝛼𝛼

2
∑ 𝑠𝑠𝑖𝑖,𝑗𝑗𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2

2𝑛𝑛
𝑖𝑖,𝑗𝑗=1   (3.23) 

Let ℎ𝑖𝑖,𝑗𝑗 = 𝑠𝑠𝑖𝑖,𝑗𝑗𝑓𝑓𝑖𝑖,𝑗𝑗. Eq. (3.23) is equivalent to: 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
43 
 

 min𝐔𝐔  1
2
‖𝐗𝐗 − 𝐔𝐔‖𝐹𝐹2   + 𝛼𝛼

2
∑  𝑛𝑛
𝑖𝑖,𝑗𝑗=1 ℎ𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2

2
   (3.24) 

⇒ min𝐔𝐔  1
2
‖𝐗𝐗 − 𝐔𝐔‖𝐹𝐹2   + 𝛼𝛼

2
𝑡𝑡𝑡𝑡(𝐔𝐔𝑇𝑇𝐋𝐋𝐔𝐔)     (3.25) 

 ⇒ min𝐔𝐔  1
2
𝑡𝑡𝑡𝑡((𝐗𝐗 − 𝐔𝐔)𝑇𝑇 (𝐗𝐗 − 𝐔𝐔) 

 )  + 𝛼𝛼
2
𝑡𝑡𝑡𝑡(𝐔𝐔𝑇𝑇𝐋𝐋𝐔𝐔)   (3.26) 

 ⇒ min𝐔𝐔  1
2
𝑡𝑡𝑡𝑡((𝐗𝐗𝑇𝑇 − 𝐔𝐔𝑇𝑇) (𝐗𝐗 − 𝐔𝐔) 

 )  + 𝛼𝛼
2
𝑡𝑡𝑡𝑡(𝐔𝐔𝑇𝑇𝐋𝐋𝐔𝐔)   (3.27) 

 ⇒ min𝐔𝐔  1
2
𝑡𝑡𝑡𝑡(𝐗𝐗𝑇𝑇𝐗𝐗 − 2𝐔𝐔𝑇𝑇𝐗𝐗 + 𝐔𝐔𝑇𝑇𝐔𝐔 ) 

 
  + 𝛼𝛼

2
𝑡𝑡𝑡𝑡(𝐔𝐔𝑇𝑇𝐋𝐋𝐔𝐔) (3.28) 

After conducting a derivative on Eq. (3.28) with respect to U, we get   

 ⇒ 1
2

(−2𝐗𝐗 + 2𝐔𝐔 ) 
 
  + 𝛼𝛼

2
(𝐋𝐋𝐔𝐔 +  𝐋𝐋𝑇𝑇𝐔𝐔) = 0 (3.29) 

  ⇒ −𝐗𝐗 + 𝐔𝐔 + 𝛼𝛼
2
𝐋𝐋𝐔𝐔 + 𝛼𝛼

2
𝐋𝐋𝑇𝑇𝐔𝐔 = 0    (3.30) 

  ⇒ 𝐔𝐔 + 𝛼𝛼
2
𝐋𝐋𝐔𝐔 + 𝛼𝛼

2
𝐋𝐋𝑇𝑇𝐔𝐔 = 𝐗𝐗    (3.31) 

  ⇒ (1 + 𝛼𝛼
2
𝐋𝐋 + 𝛼𝛼

2
𝐋𝐋𝑇𝑇)𝐔𝐔 = 𝐗𝐗    (3.32) 

  ⇒ (1 + 𝛼𝛼
2

(𝐋𝐋 + 𝐋𝐋𝑇𝑇)𝐔𝐔 = 𝐗𝐗    (3.33) 

  ⇒ (1 + 𝛼𝛼
2

(2𝐋𝐋)𝐔𝐔 = 𝐗𝐗     (3.34) 

  ⇒ (1 + 𝛼𝛼𝐋𝐋)𝐔𝐔 = 𝐗𝐗     (3.35) 

  ⇒ 𝐔𝐔 = (I + 𝛼𝛼𝐋𝐋)−1𝐗𝐗     (3.36) 

 
Chapter 3. Initialization-Similarity Clustering Algorithm 

 
44 
 

3.5 Convergence Analysis 

In this section, we prove the convergence of the proposed IS clustering algorithm in 

order to prove the proposed algorithm can reach at least a locally optimal solution, so 

we apply Theorem 1. 

Theorem 1. IS clustering algorithm decreases the objective function value of 

Eq. (3.5) until it converges. 

Proof.  

By denoting 𝐅𝐅(𝑐𝑐) , 𝐒𝐒(𝑐𝑐), and 𝐔𝐔(𝑐𝑐), the results of the t-th iteration of 𝐅𝐅 , 𝐒𝐒 , and 𝐔𝐔  

respectively, we further denote the objective function value of Eq. (3.5) in the t-th 

iteration as ℒ�𝐅𝐅(𝑐𝑐), 𝐒𝐒(𝑐𝑐),𝐔𝐔(𝑐𝑐)�. 

According to Eq. (3.13) in Section 3.4, 𝐅𝐅  has a closed-form solution, thus we 

have the following inequality: 

 ℒ�𝐅𝐅(𝑐𝑐), 𝐒𝐒(𝑐𝑐),𝐔𝐔(𝑐𝑐)� ≥ ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐),𝐔𝐔(𝑐𝑐)�  (3.37) 

 According to Eq. (3.22), 𝐒𝐒  has a closed-form solution, thus we have the 

following inequality: 

 ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐),𝐔𝐔(𝑐𝑐)� ≥ ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐+1),𝐔𝐔(𝑐𝑐)�  (3.38) 

According to Eq. (3.36), 𝐔𝐔  has a closed-form solution, thus we have the 

following inequality: 

 ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐+1),𝐔𝐔(𝑐𝑐)� ≥ ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐+1),𝐔𝐔(𝑐𝑐+1)�  (3.39) 

Finally, based on above three inequalities, we get 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
45 
 

 ℒ�𝐅𝐅(𝑐𝑐), 𝐒𝐒(𝑐𝑐),𝐔𝐔(𝑐𝑐)� ≥ ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐+1),𝐔𝐔(𝑐𝑐+1)�  (3.40) 

Equation. (3.40) indicates that the objective function value in Eq. (3.5) 

decreases after each iteration of Algorithm 3.1. This concludes the proof of Theorem 1. 

3.6 Experiments 

In this section, we evaluated the performance of the proposed Initialization-Similarity 

(IS) algorithm, by comparing it with two benchmark algorithms on ten real UCI data 

sets, in terms of three evaluation metrics [138].   

Table 3.3 Description of ten benchmark data sets 

Datasets Samples Dimensions Classes 

Digital 1797 64 10 
MSRA 1799 256 12 
Segment 2310 19 7 
Solar 323 12 6 
USPS 1854 256 10 
USPST 2007 256 10 
Waveform 5000 21 3 
Wine 178 13 3 
Wireless 2000 7 4 
Yale 165 1024 15 

 
3.6.1 Data Sets 

We used ten UCI data sets in the experiments, including the standard data sets for 

handwritten digit recognition, face data sets, and wine data sets, etc. The details are 

listed in the following and summarization provide in Table 3.3. 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
46 
 

• Digital data set is made up of 1797 images (8x8). Each image is a hand-written 

digit 1-10.  

• MSRA data set is a face image data set.  

• Segment contains the instances drawn randomly from a database of 7 outdoor 

images. It has 2310 instances and 19 continuous attributes describing the images 

including saturation, Hue, etc.  

• Solar data set describes the main characteristics of the solar flare.  

• USPS is one of the standard handwritten digit recognition data sets. It contains 

the images of number from 0 to 9.  

• USPST contains 2007 handwritten digit recognition data sets. 

• Waveform data set has 5000 instances and 3 classes of waves with 21 attributes.  

• Wine data set is the results of a chemical analysis of wines with three different 

cultivars.  It contains data of 13 constituents found in each of the three types of 

wines.  

• Wireless data set collected 2000 instances of the signal strengths of seven WiFi 

signals visible on a smartphone.  

• Yale data set contains 165 grayscale images (32x32) of 15 individuals. Each 

subject has different facial expression or configuration. The decision variable is 

one of the four rooms. 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
47 
 

3.6.2 Comparison Algorithms 

Two comparison algorithms are classical clustering algorithms and their details were 

summarized below. 

• K-means clustering algorithm (re)assigns data points to their nearest cluster 

center and recalculates cluster centers iteratively with a goal to minimize the 

sum of distances between data points and cluster center. 

• Spectral clustering algorithm first forms the similarity matrix, and then calculates 

the first K eigenvectors of its Laplacian matrix to define feature vectors. Finally, 

it runs K-means clustering on these features to separate objects into K classes. 

There are different ways to calculate the Laplacian matrix. Instead of using simple 

Laplacian, we used normalized Laplacian 𝐋𝐋 = 𝐃𝐃 × 𝐋𝐋 × 𝐃𝐃 , which have better 

performance than using simple Laplacian [139].  

 
For the above two algorithms, K-means clustering conducts clustering directly 

on the original data while spectral clustering is a multi-stage based strategy, which 

constructs a graph first and then applies K-means clustering algorithm to partition the 

graph.  

3.6.3 Experiment Setup 

In the experiments, firstly, we tested the robustness of the proposed IS clustering 

algorithm by comparing it with K-means clustering and spectral clustering algorithms 

using real data sets in terms of three evaluation metrics widely used for clustering 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
48 
 

research. Due to the sensitivity of K-means clustering to its initial cluster centers, we 

ran K-means clustering and spectral clustering algorithms 20 times and chose the 

average value as the final result. Secondly, we investigated the parameters’ sensitivity 

of the proposed IS clustering algorithm (i.e. α and β in Eq. (3.5)) via varying their 

values to observe the variations of clustering performance. Thirdly, we demonstrated 

the convergence of Algorithm 3.1 to solving the proposed objective function Eq. (3.5) 

via checking the iteration times when Algorithm 3.1 converges. 

3.6.4 Experimental Results Analysis 

We listed the clustering performance of all algorithms in Table 3.5, which shows that 

our IS clustering algorithm achieved the best performance on all ten data sets in terms 

of ACC and NMI, as well as outperformed K-means clustering algorithm on all ten data 

sets in terms of Purity. IS clustering algorithm outperformed spectral clustering 

algorithm on all eight data sets in terms of Purity but performed slightly worse than 

spectral clustering algorithm on three data sets USPT, USPST and Yale. The difference 

in Purity results between IS clustering algorithm and the spectral clustering algorithm 

was only 1%. More specifically, IS clustering algorithm increased ACC by 6.3% 

compared to K-means clustering algorithm and 3.3% compared to spectral clustering 

algorithm. IS clustering algorithm increased NMI by 4.6% compared to K-means 

clustering algorithm and 4.5% compared to spectral clustering algorithm. IS clustering 

algorithm increased Purity by 4.9% compared to K-means clustering algorithm and 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
49 
 

2.9% compared to spectral clustering algorithm. Other observations were listed in the 

following sections. 

First, both one-step clustering algorithm, e.g. IS clustering algorithm and two-

step clustering algorithm, e.g. spectral clustering algorithm outperformed K-means 

clustering algorithm. This implied that constructing the graph or learning a new 

representation of original data points improved the clustering performance.  This means 

that using new representation can generate better clustering than the methods using 

original data in clustering tasks. The reason could be that original data generally 

contains more or less redundant information, which is always true in real data set and 

the redundancy undoubtedly corrupts the performance of clustering models. In contrast, 

two similarity matrix-based methods construct the new representation based on original 

data to conduct clustering, which can relieve the affection of redundancy from original 

data, so the clustering performance can be improved. 

Second, one-step clustering algorithm, e.g. IS clustering algorithm, performed 

better than two-step clustering algorithms, e.g. spectral clustering algorithm. Compared 

to the spectral clustering algorithm that first uses the original data to construct the 

similarity matrix and then uses the orthogonal decomposition onto the similarity matrix 

to output new representation, our method employed an adaptive learning strategy to 

dynamically update the similarity matrix and new representation in a unified 

framework. In this way, both new representation and similarity of our method can 

capture the intrinsic correlation of data, which means our method can easily output 

better clustering results than classical spectral clustering methods. This proves that the 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
50 
 

goal of the similarity matrix learning and the new representation are the same which 

leads to optimal clustering results, whereas the two-step clustering algorithm with 

separate goals achieves sub-optimal results. 

 
Table 3.4 ACC results of IS algorithm on ten benchmark data sets 

The highest score of each evaluation metric for each data set is highlighted in bold font. 

Datasets K-means Spectral IS 

Digital 0.73 0.77 0.80 
MSRA 0.49 0.50 0.57 

Segment 0.55 0.56 0.63 
Solar 0.50 0.51 0.55 
USPS 0.62 0.67 0.70 

USPST 0.66 0.70 0.71 
Waveform 0.50 0.51 0.57 

Wine 0.65 0.69 0.71 
Wireless 0.94 0.96 0.97 

Yale 0.39 0.45 0.46 
Rank 3 2 1 

 
Table 3.5 NMI results of IS algorithm on ten benchmark data sets 
The highest score of each evaluation metric for each data set is highlighted in bold font. 

Datasets K-means Spectral IS 

Digital 0.73 0.72 0.78 
MSRA 0.59 0.56 0.63 

Segment 0.61 0.52 0.63 
Solar 0.34 0.34 0.42 
USPS 0.61 0.66 0.70 

USPST 0.61 0.66 0.68 
Waveform 0.36 0.37 0.40 

Wine 0.43 0.42 0.43 
Wireless 0.88 0.89 0.91 

Yale 0.47 0.51 0.51 
Rank 2 2 1 


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
51 
 

Table 3.6 Purity results of IS algorithm on ten benchmark data sets 
The highest score of each evaluation metric for each data set is highlighted in bold font 

Datasets K-means Spectral IS 

Digital 0.76 0.78 0.81 
MSRA 0.53 0.53 0.58 

Segment 0.58 0.58 0.64 
Solar 0.55 0.55 0.61 
USPS 0.69 0.75 0.74 

USPST 0.71 0.77 0.76 
Waveform 0.53 0.51 0.59 

Wine 0.69 0.69 0.71 
Wireless 0.94 0.96 0.97 

Yale 0.41 0.47 0.46 
Rank 3 2 1 

 
Figure 3.1 ACC results of IS algorithm on ten benchmark data sets 

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K-means

Spectral

IS


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
52 
 

Figure 3.2 NMI results of IS algorithm on ten benchmark data sets

Figure 3.3 Purity results of IS algorithm on ten benchmark data sets 

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K-means

Spectral

IS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

K-means

Spectral

IS


Chapter 3. Initialization-Similarity Clustering Algorithm 

 
53 
 

3.6.5 Parameters’ Sensitivity 

We varied parameters 𝛼𝛼 and 𝛽𝛽 in the range of [10−2, . . . 102], and recorded the values 

of ACC, NMI and Purity of ten data sets clustering results for IS clustering algorithm 

in Figures 3.4-3.6.  

First, different data sets needed different ranges of parameters to achieve the 

best performance. For example, IS clustering algorithm achieved the best ACC (97%), 

NMI (91%) and Purity (97%) on data set Wireless when both parameters  𝛼𝛼 and 𝛽𝛽 were 

10. But for the data set Digital, IS clustering algorithm achieved the best ACC (80%), 

NMI (78%) and Purity (81%) when 𝛽𝛽  = 100 and 𝛼𝛼  =0.1. This indicated that IS 

clustering algorithm was data-driven.  

Second, the clustering ACC results had less than 3% average changes when the 

parameter 𝛼𝛼 varied in the range of [10−2, . . . 102] in eight out of ten data sets. The 

lowest average change was 1% (i.e., Wine and Wireless data sets) when the parameter 

𝛼𝛼  varied in the range of [10−2, . . . 102]. The biggest average change was 5% (e.g., 

Waveform data set) when the parameter 𝛼𝛼 varied in the range of [10−2, . . . 102]. This 

indicated that IS clustering algorithm was not very sensitive to the parameter 𝛼𝛼. 

Third, the c