Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere without the permission of the Author. Improved K-means Clustering Algorithms A thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science Massey University Tong Liu 2020 Abstract K-means clustering algorithm is designed to divide the samples into subsets with the goal that maximizes the intra-subset similarity and inter-subset dissimilarity where the similarity measures the relationship between two samples. As an unsupervised learning technique, K-means clustering algorithm is considered one of the most used clustering algorithms and has been applied in a variety of areas such as artificial intelligence, data mining, biology, psychology, marketing, medicine, etc. K-means clustering algorithm is not robust and its clustering result depends on the initialization, the similarity measure, and the predefined cluster number. Previous research focused on solving a part of these issues but has not focused on solving them in a unified framework. However, fixing one of these issues does not guarantee the best performance. To improve K-means clustering algorithm, one of the most famous and widely used clustering algorithms, by solving its issues simultaneously is challenging and significant. This thesis conducts an extensive research on K-means clustering algorithm aiming to improve it. First, we propose the Initialization-Similarity (IS) clustering algorithm to solve the issues of the initialization and the similarity measure of K-means clustering algorithm in a unified way. Specifically, we propose to fix the initialization of the clustering by using sum-of-norms (SON) which outputs the new representation of the original samples and to learn the similarity matrix based on the data distribution. Furthermore, the derived new representation is used to conduct K-means clustering. Second, we propose a Joint Feature Selection with Dynamic Spectral (FSDS) clustering algorithm to solve the issues of the cluster number determination, the similarity measure, and the robustness of the clustering by selecting effective features and reducing the influence of outliers simultaneously. Specifically, we propose to learn the similarity matrix based on the data distribution as well as adding the ranked constraint on the Laplacian matrix of the learned similarity matrix to automatically output the cluster number. Furthermore, the proposed algorithm employs the L2,1-norm as the sparse constraints on the regularization term and the loss function to remove the redundant features and reduce the influence of outliers respectively. Third, we propose a Joint Robust Multi-view (JRM) spectral clustering algorithm that conducts clustering for multi-view data while solving the initialization issue, the cluster number determination, the similarity measure learning, the removal of the redundant features, and the reduction of outlier influence in a unified way. Finally, the proposed algorithms outperformed the state-of-the-art clustering algorithms on real data sets. Moreover, we theoretically prove the convergences of the proposed optimization methods for the proposed objective functions. Dedication In memory of my mother and father Acknowledgments I would like to express my sincere gratitude to my supervisors, Associate Professor XiaoFeng Zhu and Distinguished Professor Gaven Martin for all your guidance, encouragements and kindness. I am deeply thankful for the support and encouragement from Professor Dianne Brunton, Associate Professor Alona Ben-Tal, Associate Professor Evelyn Sattlegger and Ms. Linh Mills. Publications Arising from this Thesis Publications included in this thesis: • Zhou, J., Liu, T., & Zhu, J. Weighted adjacent matrix for K- means clustering. Multimedia Tools and Applications, 2019. 78: p. 33415–33434, corresponding author: Liu, T, incorporated as Chapter 1 and Chapter 2. • Liu, T., Zhu, J., Zhou, J., Zhu, Y., & Zhu, X. Initialization- similarity clustering algorithm. Multimedia Tools and Applications, 2019. 78: p. 33279–33296, incorporated as Chapter 3. • Liu, T., & Martin, G. Joint Feature Selection with Dynamic Spectral Clustering. Revised version submitted to Neural Processing Letters (Springer). DOI: 10.1007/s11063-020- 10216-9, incorporated as Chapter 4. • Liu, T., Martin, G., Zhu, Y., Peng, L., & Li, L. Joint Robust Multi-view Spectral Clustering. Submitted to Neural Processing Letters (Springer). DOI: 10.1007/s11063-020- 10257-0, incorporated as Chapter 5. viii Table of Contents Table of Contents ........................................................................................................... viii List of Tables ................................................................................................................... xi List of Figures ................................................................................................................. xii List of Notations............................................................................................................. xiii Chapter 1 Introduction ...................................................................................................... 1 1.1 Motivation .................................................................................................. 1 1.2 Research Objectives ................................................................................... 6 1.3 Thesis Structure .......................................................................................... 6 Chapter 2 Literature Review ............................................................................................. 9 2.1 Clustering Algorithms ................................................................................ 9 2.2 Feature Selection ...................................................................................... 23 2.3 Outlier Reduction ..................................................................................... 26 2.4 Evaluation Measure .................................................................................. 29 2.5 Summary .................................................................................................. 30 Chapter 3 Initialization-Similarity Clustering Algorithm .............................................. 32 3.1 Introduction .............................................................................................. 32 3.2 Motivation ................................................................................................ 34 3.3 Proposed Algorithm ................................................................................. 38 3.4 Optimization ............................................................................................. 40 3.5 Convergence Analysis .............................................................................. 44 3.6 Experiments .............................................................................................. 45 3.6.1 Data Sets ........................................................................................ 45 3.6.2 Comparison Algorithms ................................................................. 47 3.6.3 Experiment Setup ........................................................................... 47 ix 3.6.4 Experimental Results Analysis ...................................................... 48 3.6.5 Parameters’ Sensitivity .................................................................. 53 3.6.6 Convergence ................................................................................... 54 3.7 Conclusion ................................................................................................ 59 Chapter 4 Joint Feature Selection with Dynamic Spectral Clustering ........................... 60 4.1 Introduction .............................................................................................. 60 4.2 Motivation ................................................................................................ 63 4.3 Proposed Algorithm ................................................................................. 67 4.4 Optimization ............................................................................................. 70 4.5 Convergence Analysis .............................................................................. 76 4.6 Experiments .............................................................................................. 80 4.6.1 Data Sets ........................................................................................ 80 4.6.2 Comparison Algorithms ................................................................. 82 4.6.3 Experiment Setup ........................................................................... 82 4.6.4 Experimental Results Analysis ...................................................... 83 4.6.5 Parameters’ Sensitivity .................................................................. 85 4.6.6 Convergence ................................................................................... 85 4.7 Conclusion ................................................................................................ 86 Chapter 5 Joint Robust Multi-view Spectral Clustering ................................................. 92 5.1 Introduction .............................................................................................. 92 5.2 Motivation ................................................................................................ 96 5.3 Proposed Algorithm ................................................................................. 98 5.4 Optimization ........................................................................................... 101 5.5 Convergence Analysis ............................................................................ 107 5.6 Experiments ............................................................................................ 111 x 5.6.1 Data Sets ...................................................................................... 111 5.6.2 Comparison Algorithms ............................................................... 111 5.6.3 Experiment Setup ......................................................................... 113 5.6.4 Experimental Results Analysis .................................................... 113 5.6.5 Parameters’ Sensitivity ................................................................ 118 5.6.6 Convergence................................................................................. 120 5.7 Conclusion .............................................................................................. 121 Chapter 6 Conclusion and Future Work ....................................................................... 123 6.1 Conclusion .............................................................................................. 123 6.2 Future Directions .................................................................................... 125 References ..................................................................................................................... 127 xi List of Tables Table 3.1 The pseudo code for K-means clustering algorithm .................................... 36 Table 3.2 The pseudo code for the spectral clustering algorithm ................................ 37 Table 3.3 Description of ten benchmark data sets ....................................................... 45 Table 3.4 ACC results of IS algorithm on ten benchmark data sets ............................ 50 Table 3.5 NMI results of IS algorithm on ten benchmark data sets ............................ 50 Table 3.6 Purity results of IS algorithm on ten benchmark data sets .......................... 51 Table 4.1 Description of benchmark datasets .............................................................. 82 Table 4.2 ACC results of FSDS algorithm on eight benchmark data sets ................... 87 Table 4.3 Purity results of FSDS algorithm on eight benchmark data sets ................. 87 Table 5.1 The six multi-view benchmark data sets .................................................... 112 Table 5.2 ACC results of JRM algorithm on six multi-view data sets ...................... 116 Table 5.3 Purity results of JRM algorithm on six multi-view data sets ..................... 116 xii List of Figures Figure 1.1 K-means Flowchart ...................................................................................... 2 Figure 1.2 Research Framework ................................................................................... 8 Figure 3.1 ACC results of IS algorithm on ten benchmark data sets ........................... 51 Figure 3.2 NMI results of IS algorithm on ten benchmark data sets ........................... 52 Figure 3.3 Purity results of IS algorithm on ten benchmark data sets ......................... 52 Figure 3.4 ACC results of IS algorithm with respect to different parameter settings . 55 Figure 3.5 NMI results of IS algorithm with respect to different parameter settings . 56 Figure 3.6 Purity results of IS algorithm with respect to different parameter settings57 Figure 3.7 Objective function values (OFVs) versus iterations for IS algorithm ....... 58 Figure 4.1 ACC results of FSDS algorithm on eight benchmark data sets .................. 88 Figure 4.2 Purity results of FSDS algorithm on eight benchmark data sets ................ 88 Figure 4.3 ACC results of FSDS algorithm with respect to different parameter settings ...................................................................................................................................... 89 Figure 4.4 Purity results of FSDS algorithm with respect to different parameter settings ...................................................................................................................................... 90 Figure 4.5 Objective function values (OFVs) versus iterations for FSDS algorithm .. 91 Figure 5.1 ACC results of JRM algorithm on six real data sets................................. 117 Figure 5.2 Purity results of JRM algorithm on four real data sets ............................. 117 Figure 5.3 ACC results of JRM algorithm with respect to different parameter settings .................................................................................................................................... 119 Figure 5.4 Purity results of JRM algorithm with respect to different parameter settings .................................................................................................................................... 120 Figure 5.5 Objective function values (OFVs) versus iterations for JRM algorithm .. 122 xiii List of Notations Symbols Description 𝐗𝐗 Data matrix 𝐱𝐱 A vector of 𝐗𝐗 𝐱𝐱𝑖𝑖 The 𝑖𝑖-th row of 𝐗𝐗 𝑥𝑥𝑖𝑖,𝑗𝑗 The element in the 𝑖𝑖-th row and 𝑗𝑗-th column of 𝐗𝐗 ‖𝐱𝐱‖2 L2-norm of 𝐱𝐱 ‖𝐗𝐗‖2,1 L2,1-norm of 𝐗𝐗 ‖𝐗𝐗‖𝐹𝐹 The Frobenius norm or the Euclidean norm of 𝐗𝐗 𝐗𝐗𝑇𝑇 𝐾𝐾 𝑉𝑉 𝑣𝑣 𝐗𝐗𝑣𝑣 The transpose of 𝐗𝐗 Cluster number Number of views View index Data matrix in the v-th view Chapter 1. Introduction 1 Chapter 1 Introduction Machine learning is a subfield of artificial intelligence that provides machines the ability to learn and improve automatically without being explicitly programmed to do so. If the machine learns to label the data automatically without knowing the pattern of the data beforehand, this type of machine learning is called unsupervised learning. Unsupervised learning is important because it is difficult to know the pattern of the data in advance [1, 2]. As an unsupervised learning technique, clustering divides a given data set into groups with the goal to both maximize the similarity of data points in the same group, and the dissimilarity of data points in different groups [3]. Clustering has been widely applied in scientific data analysis, data mining, biology, psychology, marketing, medicine, and insurance, etc. [4-9]. A search via Google Scholar found over 4.1 million entries with the keyword clustering on Dec 14, 2019. K-means clustering algorithm is one of the most popular and widely used clustering algorithms. K-means clustering algorithm has been used as part of many other algorithms since it is simple, trustable, promising, and mathematical tractability [10, 11]. 1.1 Motivation K-means clustering algorithm operates in the following steps: First, it initializes cluster centers via randomly selecting K data points as the K cluster centers. Second, it assigns Chapter 1. Introduction 2 each data point to its nearest cluster center according to a similarity measure, e.g., Euclidean distance. Third, it revises the K cluster centers as the mean of assigned data points. K-means clustering algorithm keeps repeating the last two steps until the algorithm achieves convergence [12]. The flow chart of K-means clustering algorithm is shown in Figure 1.1. K-means clustering algorithm is considered one of the most used clustering algorithms. It has been successfully applied to broad areas. Previous researches have addressed some of the issues of K-means clustering algorithm. But they didn’t address the limitations of K-means clustering algorithm in a unified manner. Addressing the limitations of K-means clustering algorithm in a unified way is challenging and significant. Figure 1.1 K-means Flowchart Start Set K initial cluster centers randomly Assign data points to the closest cluster centers Recalculate the new cluster centers Convergence? End Yes No Chapter 1. Introduction 3 First, the clustering result of K-means clustering algorithm depends on the initialization of cluster centers but random choosing the cluster centers may not lead to an optimal result. It is also difficult to reproduce the clustering results due to the randomness of initialization of K-means clustering algorithm. Many of the current clustering algorithms have solved the initialization problem of K-means clustering algorithm [4, 13-15]. For example, Duan et al. developed an algorithm to calculate the density to select the initial cluster centers [13]. Lakshmi et al. proposed to use nearest neighbors and feature means to decide the initial cluster centers [14]. Second, the clustering result of K-means clustering algorithm depends on the similarity measure. K-means clustering algorithm assigns each data point to its closest cluster center based on a similarity measure. Euclidean distance is often used in K-means clustering algorithm to determine the similarity by calculating the distance between two data points. However, Euclidean distance measure does not account for the factors such as cluster sizes, dependent features or density [16, 17]. Thus K-means clustering algorithm is not good for indistinct or not well-separated data sets [18]. Several works addressed the similarity measure problem of K-means clustering algorithm [19-24]. For example, spectral clustering algorithm uses spectral representation to replace original data points, and then conducts K-means clustering. To do this, spectral clustering algorithm first generates the similarity matrix and then conducts eigenvalue decomposition on the similarity matrix to obtain the spectral representation. Finally, K-means clustering is conducted on the spectral representation. Chapter 1. Introduction 4 Third, K-means clustering algorithm relies on the given cluster number K. As an unsupervised algorithm, K-means clustering algorithm is supposed to be used against data which is not labelled. Without knowing the label, the cluster number may not be known beforehand. Robust continuous clustering algorithm is able to automatically calculate the cluster number beforehand [4]. However, this algorithm needs a well calculated similarity matrix beforehand as an input to be able to produce good clustering outcome. Previous clustering algorithms only fixed part of the issues of the K-means clustering algorithm. When a clustering algorithm addresses those problems separately, it is easily to be trapped into the sub-optimal results, which means it is hard to obtain a global optimal solution, for example, even if a best initial value is found to produce optimal results or the best similarity matrix is found to produce optimal results, but the final optimal results may not be obtained. Because the results of the individual steps are not obtained according to the requirements of the next step. It would be challenging and significant if a new clustering algorithm could fix the issues of the initialization, cluster number determination and similarity measure problems of K-means clustering algorithm in a unified framework, which means one aspect is reflected in other aspects to achieve global optimal results. Real-world data sets often contain high-dimensional features, some of which are insignificant for clustering. Data with high-dimensional features, i.e., high-dimensional data, increases the computation cost as well as the “Curse of Dimensionality”. In this circumstance, K-means clustering algorithm using Euclidean distance to measure the Chapter 1. Introduction 5 similarity is not robust to data with high-dimensional features [25]. Hence, reducing the redundant feature is needed for conduct clustering analysis on high-dimensional data. Data almost invariably contains noise, outliers and errors due to inadequate data measure, collection, processing or just the inherent variability. Outliers can distort the distribution of the data set. For example, K-means clustering algorithm using the mean of all data points in one cluster to decide the new cluster center makes sense when all the data points lie a normal distance from other data points. However, outliers can strongly impact the mean calculation of the whole cluster. As a result, this will push cluster centers closer to the outlier. Outliers could have a strong impact on the final cluster configuration. Hence, to achieve robust clustering performance, it is necessary to reduce the influence of outliers. Nowadays data could be collected from multiple sources or different aspects. For example, images shared on photo sharing sites such as Instagram or Flickr have complementary information such as description, tags, location, and video, etc. The data collected from multiple views are called multi-view data. Each view of the data set has its own properties to contribute to the understanding of the subject matter. Normally K- means clustering algorithm was designed for clustering single-view data, the naive solution for conducting clustering on multi-view data by K-means clustering algorithm is to cluster the data with concatenated features across all views of the multi-view data. However, such a simple concatenation approach treats different views equally, even though different views have their own specific properties for their features. Hence, it is Chapter 1. Introduction 6 essential to improve K-means clustering algorithm on multi-view data clustering as well as solving the aforementioned issues. 1.2 Research Objectives The aim of this thesis is to design and evaluate new clustering algorithms to overcome the issues of previous K-means clustering algorithm. The thesis framework is demonstrated in Figure 1.2. The specific objectives of this thesis are listed as follows: • Objective 1: To solve the issues of the initialization and the similarity measure of K-means clustering algorithm in a unified way. • Objective 2: To solve the issues of the cluster number determination, the similarity measure, and to improve the robustness of clustering by selecting effective features and reducing the influence of outliers in a unified way. • Objective 3: To develop multi-view clustering algorithm while solving the issues of the initialization, the cluster number determination, the similarity measure, feature selection and outlier reduction in a unified way. 1.3 Thesis Structure This thesis is structured as follows. • Chapter 2 presents literature review including clustering analysis, feature selection, outlier reduction and evaluation measure. Chapter 1. Introduction 7 • Chapter 3 presents Initialization-Similarity (IS) clustering algorithm which solves the issues of initialization and similarity measure of K-means clustering algorithm in a unified way. IS clustering algorithm fulfills our objective 1. The proposed IS clustering algorithm outperformed both the classical clustering algorithms K-means clustering algorithm and well-known Spectral clustering algorithm. • Chapter 4 presents Joint Feature Selection with Dynamic Spectral (FSDS) clustering algorithm which solves the issues of cluster number determination, similarity measure, and the robustness of clustering by selecting useful features and reducing the influence of outliers in a unified way. FSDS clustering algorithm fulfills our objective 2. The proposed FSDS clustering algorithm outperformed the classical clustering algorithms K-means clustering algorithm, well-known Spectral clustering algorithm, Clustering and projected clustering with adaptive neighbors algorithm (CAN) [24] and Robust continuous clustering algorithm (RCC) [4]. • Chapter 5 presents Joint Robust Multi-view (JRM) Spectral Clustering algorithm solves initialization, cluster number determination, similarity measure, feature selection, and outlier reduction issues for multi-view data in a unified way. JRM clustering algorithm fulfills our objective 3. The proposed JRM clustering algorithm outperformed the classical clustering algorithms K-means clustering algorithm, Graph-Based system (GBS) [26], Adaptively weighted Procrustes (AWP) [27], and Multi-view low-rank sparse subspace clustering (MLRSSC) [28]. • Chapter 6 presents the conclusions and future work. Chapter 1. Introduction 8 Figure 1.2 Research Framework K-means issues Initialization Similarity measure Cluster number Robustness IS in Chapter 3 Initialization Similarity measure FSDS in Chapter 4 Similarity measure Cluster number Robustness: • Feature selection • Outlier reduction JRM in Chapter 5 Initialization Similarity measure Cluster number Robustness: • Feature selection • Outlier reduction • Multi-view clustering Chapter 2. Literature Review 9 Chapter 2 Literature Review Clustering is an unsupervised learning technique which divides a given data set into groups with the goal to maximize the intra-subset similarity and inter-subset dissimilarity. This chapter reviews the research topics related to this thesis, including the clustering algorithms, feature selection techniques, outlier reduction methods, and evaluation metrics. 2.1 Clustering Algorithms Clustering algorithms can be classified as single-view clustering algorithms or multi- view clustering algorithms based on if the clustering algorithms aim to cluster single- view data or multi-view data. 2.1.1 Single-view Clustering Clustering can also be generally categorized into non-graph-based approaches and graph-based approaches, based on whether the clustering algorithm constructs a similarity graph or not. A. Non-Graph-Based Algorithms The non-graph-based algorithms conduct clustering directly on the original data without constructing a similarity graph. The non-graph-based clustering algorithms can Chapter 2. Literature Review 10 be further grouped into different categories such as partitioning-based, hierarchical- based, distribution-based, density-based, nature-based, etc. Partitioning-based clustering algorithms, also known as centroid-based clustering or distance-based clustering, divide data in one level into a number of partitions, where each partition represents a cluster. The center of the data points in each partition is regarded as the cluster center of the corresponding cluster. K-means clustering algorithm is one of the most famous representatives of this kind of clustering algorithms [29]. Specifically, K-means clustering algorithm first randomly selects K data points as the K cluster centers, and then assigns each data point to its nearest cluster center according to Euclidean distance. It keeps recalculating the cluster centers followed by assigning each data point to a cluster until the algorithm achieves convergence [12]. However, K-means clustering algorithm needs the cluster number as input, so it is not suitable for a data set with an unknown cluster number. It is also sensitive to the initialization of the cluster centers because the random choice of cluster centers may produce different clustering results on different runs of this algorithm [29]. Furthermore, K-means clustering algorithm measures the similarity by using the Euclidean distance which gives the same importance to all the data points without consider other factors such as density, dependent features, shape, patterns or scale of data points [30, 31]. For example, it is difficult for K-means clustering algorithm to separate non-convex clusters. There are numbers of other algorithms based on partitioning clustering algorithms, e.g. K-medoids, COTCLUS, and Tabu search. K- medoids chooses the data points located near their center to represent the clusters. The Chapter 2. Literature Review 11 rest of remaining data points are clustered with the representative data centers to which they are the most similar based on the minimal sum of the dissimilarities between data points and their corresponding cluster center points [32]. Instead of using only one center for each class, COTCLUS, an improved centroid-based clustering algorithm, uses suitable centroids from another clustering. It finds two centroids from one cluster and replace them by two centroids from the other cluster in such a way that maximum decreases the mean square error of the first clustering. It constructs a clustering from two suboptimal clustering results based on the belief that each suboptimal clustering has benefits regarding to containing some of the correct clusters [33]. After modifying centroids, it applies K-means clustering algorithm for final fine-tuning [33]. A Tabu based clustering algorithm employs the center driven approach of the K-means clustering algorithm with the guidance of Tabu search, which is a local or neighborhood search algorithm that accepts the worsening searches of no improving search is available and discourages the search from going back to previously visited search [34]. The K-medoids, COTCLUS, and Tabu search example like other partitioning-based clustering algorithms need to specify the cluster number K before the execution of the algorithms. In comparison with the partitioning-based clustering, which divides the set of data into un-nested clusters, the hierarchical-based clustering builds a tree of nested clusters. The hierarchical-based algorithms, also known as connectivity-based clustering, build a hierarchical relationship among data points to conduct clustering. Hierarchical clustering is usually represented by a tree structure, where each data point Chapter 2. Literature Review 12 is identified as a leaf and each node is a cluster. The division and agglomeration are two common approaches of the hierarchical-based clustering. In the division approach, which is also called top-down approach, all the data points are initially in one cluster and then are divided into smaller clusters recursively. Conversely the agglomerative approach also called bottom-up approach which treats each data point as a cluster at the start, and then continuously agglomerate pairs of clusters to build a cluster hierarchy until all clusters have been merged into a single cluster that contains all data points [35, 36]. For example, the hierarchical clustering algorithm for binary data based on cosine similarity (HABOC) uses agglomerative hierarchical clustering procedure [37]. HABOC assesses similarity between data points and computes similarity of data sets containing multiple data points using the cosine similarity, and then exploits hierarchical clustering method to compresses data and merge two clusters based on the cosine feature vector of a set and additivity of the cosine feature vector of a set [37]. HABOC needs the cluster number as an initial parameter. Other hierarchical clustering examples include robust clustering using links (ROCK) and clustering using representatives (CURE) [38-44]. ROCK clustering algorithm draws a number of data points randomly from the original data set as inputs along with the desired cluster number K. Instead of using distances to conduct clustering, ROCK uses the number of links which is defined as the number of common neighbors as the similarity measure [42]. The reasoning behind is that the data points belonging to the same cluster most likely have a large number of common neighbours, thus more links. Hence the larger the number of links between data points, the greater likelihood they belong to the same Chapter 2. Literature Review 13 cluster. But ROCK ignores the possible differences in the similarity measure of different clusters inside the same data set. CURE selects well scattered points from the cluster to represent each cluster, and then shrink them toward the cluster [40]. It chooses more than one representative points from each cluster by using single linkage approaches, the similarity of two clusters is determined by the similarity of their most similar data points. Finally, the clusters with the closest representative points are clustered together. CURE uses random sampling and partitioning to speed up clustering [40]. But it is limited by choosing a fixed amount of scattered data points to represent cluster, and by applying a constant factor to shrink those representatives towards to their cluster centers [45]. CURE also ignores the information about the aggregate interconnectivity of data points in two clusters. Hierarchical clustering algorithms are particularly good when the data has an underlying hierarchical structure [35]. However, the efficiency of hierarchical clustering algorithms is relatively low compared with the linear complexity of partitioning clustering algorithms. Closely related to statistics, the distribution-based clustering algorithms assume that the data generated from the same distribution belongs to the same cluster. However, not all the data points have several distributions and the parameters have a strong impact on the clustering results [29]. Examples of distribution-based clustering algorithms include incremental local distribution-based clustering algorithm with the Bayesian adaptive resonance theory (ILBART) [46], Gaussian mixture model (GMM) [47] and balanced iterative reducing and clustering using hierarchies (BIRCH) [48, 49]. ILBART first obtains some data patterns with snapshots. Then, the data pattern is Chapter 2. Literature Review 14 clustered by cluster choosing, matching test, and updating learning three stages. The variation of the covariance determinant, the combining threshold and the choice function are simultaneously considered in determination of the local distribution of the winning cluster [46]. ILBART is sensitive to the data order and its computational stability needs to be improved [46]. GMM uses a probabilistic approach and describes each cluster by its cluster center, covariance, and size. It randomly initializes a fixed number of Gaussian distributions to the data and iteratively optimizes Gaussian distributions parameters such as mean, variance and weight for each cluster. Finally, it calculates the probabilities of data points belonging to each of the clusters [47]. There may be no Gaussian distributions for many real data sets. Besides the issue of Gaussian distributions assumption, choosing the initial number of Gaussian distributions sets and random initialization are also issues of Gaussian mixture model [50]. BIRCH clustering algorithm summarizes the information that retains as much distribution information as possible, and then conducts the clustering on the data summary. Specifically, BIRCH clustering algorithm takes original data set and desired cluster number, and then conducts clustering in the four phases. It first computes the clustering feature tree. Second, it builds a smaller clustering feature tree and regrouping crowded sub-clusters into larger ones. Third, it computes the cluster centers of each cluster and uses an adaptation of the agglomerative clustering to cluster all the leaves of the clustering feature tree. Fourth, it uses the cluster centers to conduct the final clustering. BIRCH is sensitive to the data order and non-spherical clusters [39]. Chapter 2. Literature Review 15 Density-based clustering algorithms partition data points into clusters defined as dense regions of data points separated by low-density regions. Examples of density- based clustering algorithms include density-based spatial clustering of applications with noise (DBSCAN) [51, 52], an attempt at improving density-based clustering algorithms (AIDCA) [53], and two-phase clustering algorithm with a density exploring distance measure (TADEDM)[54]. DBSCAN recognizes each cluster by finding a distinctive density of points by a notably large amount higher than outside of the cluster. Minimum points, core points, border points and neighbourhood are important concepts in DBSCAN. Minimum points define the minimum number of points required to form a cluster. A core point is a point which has at least minimum points within neighbourhood from itself. A border point is a point has at least one core point at a neighbourhood distance. The neighbourhood value defines the cut-off distance of a data point from the core point for it to be clustered as a part of a cluster or not. A point is density-reachability point if it is within neighbourhood distance from the core point. A core point and all the points within a neighbourhood distance form a core set. All the overlapping core sets are grouped together to form a cluster. A point, neither a core nor a border point, and has less than minimum points within neighbourhood distance from itself is a noise point. DBSCAN is not entirely deterministic because some border points could be reachable from more than one cluster. DBSCAN depends on the distance threshold estimation and it cannot handle data sets with large varying densities [51]. AIDCA creates adaptive grids on the data and then merging cells based on local density to form a cluster [53]. It considers each axis of the grid space separately and creates a Chapter 2. Literature Review 16 number of initial bins for each axis. These uniform bins have size of the data on its axis divided by the number of bins. The density is the sum of the data points in each bin, resulting in a histogram. AIDCA goes through each bin and compares its density with the neighboring one. If the density is less than the set merge-value, the two bins are merged and will be part of the same grid square in the final adaptive grids. The result of this is that neighboring grids are likely to have differing densities. The distribution of grids should result in a small number of denser grids that contain cluster centers surrounded by a number of less dense grid cells that constitute the edges of the cluster that is merged into the core [53]. AIDCA needs to know the number of bins to create and the set merge-value. It is difficult to determine the correct set of parameters. TADEDM is a two-phase clustering method, which applies K-means clustering algorithm in the first phase to obtain the initial clusters which are used as inputs in the second phase [54]. In the second phase, all the data points are clustered using K-means clustering with a density exploring distance measure, which refers to that data points close in distance have high affinity with each other and data points locating in the same cluster have high affinity with each other [54]. Due to using the K-means clustering algorithm in this algorithm, it requires the prior cluster number and it also suffers the initialization problem. The density-based algorithms are based on the assumption that the data points in the high-density region belong to the same cluster. However, the results of density-based algorithms will suffer if the density of data points with large difference. Moreover, most density-based algorithms are also sensitive to the parameters estimation [55]. Chapter 2. Literature Review 17 Imitating the behavior of natural and biological systems, some nature-inspired optimization algorithms have been developed [56]. The nature-inspired optimization algorithms are combined with clustering algorithms to obtain the global optimum solution. The crow search algorithm (CSA) combines the K-means clustering algorithm with intelligent behaviour of the crows to obtain the global optimum solution. CSA requires the cluster number to conduct the clustering [57]. The krill herd algorithm (KHA) models the behaviour of individual krill within a larger krill swarm to find the cluster center [58]. It randomly initializes the data structure representing a single krill, then it iteratively generating the fitness function for each krill (data point) of the population, which is similar to calculating the optimized functions for the coordinates of the Krill’s position [58]. The flower pollination algorithm (FPA) is another example of nature-inspired optimization procedures. It is inspired by the process of flower pollination. Specifically, to mimic this behavior, FPA employs Levy flight distribution, which is a random walk in which the step lengths have heavier tails than the exponential distribution [58, 59]. CSA, KHA, and FPA are like most of the current nature-inspired algorithms lack of clear mathematical and theoretical proof of convergence [60]. B. Graph-Based Algorithms Instead of conducting clustering directly on the original data points, most graph-based clustering algorithms will first construct a graph and then apply a clustering algorithm to partition the graph. Graph representation represents the high-order relationship among data points which is easier to interpret the complex relationship inherent in the Chapter 2. Literature Review 18 data points than to interpret it from the original data points directly. A graph is a set of nodes or vertices with connected edges which have weights associated with them. A node or a vertex of the graph represents a data point and the edge represents the relationship between the data points. The similarity graph represents the similarities between data points. The similarity graph is represented by the similarity matrix, a square symmetric adjacency matrix, where the row and column indices represent the data points, and the entries indicate pairs of data points are connected or not. Two vertices are connected if the similarity between the corresponding data points is larger than a certain threshold. The edges within a cluster should have high weight values because data points within the same cluster are similar to each other. The edges between clusters should have low weight values because data points in different clusters are dissimilar from each other. Then the clustering problem is transformed into the graph cutting problem. The graph is cut into subgraphs, each subgraph being a cluster. The nodes in a cluster are well connected to nodes in the same cluster but not the nodes outside its cluster. Spectral clustering algorithm is a typical example of graph-based algorithms. It has become increasingly popular. Spectral clustering algorithm first creates a similarity matrix and a diagonal degree matrix, which is the sum of all the weights on each row in a similarity matrix. Then it defines a feature vector by computing the first K eigenvectors of its Laplacian matrix, which is the degree matrix subtracting the similarity matrix. Finally, it runs K-means clustering on these features to separate Chapter 2. Literature Review 19 objects into K clusters [61]. Spectral clustering algorithm is a multi-step algorithm and it requires the cluster number to be predefined. While some graph-based algorithms construct coefficient vectors of two data points to analyse the similarity between two data points [62], some graph-based algorithms construct hypergraph to represent a set of spatial data [63, 64]. For example, low-rank representation (LRR) identifies the subspace structures from data points and then finds the lowest rank representation among data points to represent the original data points [65]. A low-rank kernel learning graph-based clustering (LKLGC) algorithm is based on a multiple kernel learning with assumption that the consensus kernel matrix is a low-rank matrix and lies in the neighbourhood of the combined kernel matrix [66]. The spectral clustering algorithm is applied to get the final clustering results for LKLGC algorithm, hence the cluster number needs to be predefined [66]. A hybrid clustering algorithm based on minimum spanning tree of natural core points (NCP) first adaptively obtains the number of neighborhood parameter, and finds all the nature core points of datasets, and then it breaks the datasets into subsets and constructs the minimum spanning tree of natural core points. Finally, it cuts the maximum edge of the minimum spanning tree of natural core points iteratively until obtains the desired cluster number [67]. NCP needs the cluster number for its final step of clustering. The hierarchical clustering using dynamic modeling (CHAMELEON) uses a graph partitioning algorithm to divide the data points into several relatively small sub- clusters initially, and then finds the genuine clusters by repeatedly combining these sub- clusters if they are close together and interconnectivity is high [41]. CHAMELEON is Chapter 2. Literature Review 20 a graph-based two-phase hierarchical clustering and it requires the predefined cluster number. Clustering and projected clustering with adaptive neighbors algorithm (CAN) learns the data similarity matrix and then impose the rank constraint to the Laplacian matrix of the data similarity matrix [24]. In the end of the process the connected components in the resulted similarity matrix represent the clusters of the original data points [24]. CAN learns the data similarity matrix and clustering structure simultaneously. But it needs to know the number of the cluster beforehand. Robust Continuous Clustering (RCC) continuously optimizes a robust objective based on robust estimation [4]. RCC optimizes clustering and its new representation learning jointly [68]. According to RCC algorithm, each data point has a dedicated representative, which locates at the data point initially. Throughout the clustering process, the representatives move and combine into clusters. Despite objective function of RCC is not convex, the optimization is performed by using standard linear least squares solvers [4]. The RCC does not need prior knowledge of the cluster number. However, it needs the similarity matrix beforehand. Graph-based clustering algorithms improve non-graph-based clustering algorithms by generating the representation of original data points. However, current graph-based clustering algorithms use a multi-stage strategy which learns the similarity matrix, the new representation, or the clustering structure separately. The first stage goal of learning a similarity matrix does not always match the second stage goal of achieving optimal new representation, and thus not guaranteed to always outperform Chapter 2. Literature Review 21 non-graph-based clustering algorithms. Moreover, most graph-based clustering algorithms still use non-graph-based clustering algorithms in the final stage and thus do not simultaneously solve the initialization, similarity measure or cluster number issues of non-graph-based clustering algorithms. 2.1.2 Multi-view Clustering The existing multi-view clustering algorithms can be broadly categorized to concatenation-based approach, distribution-based approach, and centralization-based approach. A concatenation-based multi-view algorithm conducts clustering on the new concatenated feature vectors of each view. Examples of concatenation-based algorithms include concatenation K-means clustering and feature concatenation multi- view subspace clustering [69]. K-means clustering algorithms was developed for single-view data sets. For multi-view data sets, K-means clustering algorithm conducts clustering on the concatenated features across all views. This simple concatenation approach did not consider the unique nature of different views, even though different views have their own specific properties for their features. Furthermore, it may lead to a critical issue of “curse of dimensionality”, which refers to a fixed number of data points become increasingly “sparse” as the dimensionality increase. The “curse of dimensionality” affects the clustering results [70]. A distribution-based multi-view algorithm conducts clustering on every view of a multi-view data set individually, and then synthesizes these results from individual views for final clustering. For example, co-regularized spectral clustering algorithm Chapter 2. Literature Review 22 uses one single objective function for individual view and combines spectral graphs from different views for final K-means clustering [71]. A low-rank multi-view matrix completion (lrMMC) algorithm first seeks a low dimensional representation where the common subspace is constrained to be low rank and combination weights which are learned to explore complementarity between different views [72]. Mutual kernel completion algorithm applies different predefined kernels for different views. Then these kernels are combined to an unified kernel [73]. An ensemble approach to multi- view multi-instance learning builds models on multiple heterogeneous data views by combining view learners and pursuing consensus among the weighted class [74]. However distribution-based multi-view algorithms do not fully use the information of multi-view and thus is unavailable to produce reasonable clustering results [75]. Compared with concatenation-based and distribution-based approaches, a centralization-based approach achieves better performance since it takes information from all views of a multi-view data set to conduct clustering [76]. A weighted hybrid fusion method constructs an objective function with rank consistency constraint [77]. Graph-based system (GBS) automatically weights the constructed graph of each view, and then generates a unified graph matrix [26]. Although it dynamically generates the weight of each graph matrix, GBS needs the number of neighbors as a prior. Furthermore the learning of the unified graph and the constructing graphs are in two separate stages. Adaptively weighted Procrustes (AWP) weights each view according to its clustering capacity and forms a weighted Procrustes average problem accordingly [27]. AWP requires spectral embedding matrix calculated beforehand as an input. The Chapter 2. Literature Review 23 goal of conducting the spectral embedding matrix is different from the second stage of multi-view clustering, and thus not guaranteed to always have optimal performance. Multi-view low-rank sparse subspace clustering (MLRSSC) jointly learns an affinity matrix constrained by sparsity and low-rank, while at the same time balances between the agreements across different views [28]. MLRSSC learns the joint affinity matrix first, and then uses the spectral clustering algorithm to complete the final clustering. The learning of the affinity matrix and final spectral clustering are in two separate stages. Thus, it cannot guarantee to always have optimal clustering results. 2.2 Feature Selection Real-world data sets are rich in information. They often contain high-dimensional features. However, not all features are effective for clustering algorithms. The high- dimensional features not only increase the computational time for machine learning, but also increasing risk of overfitting. Dimensionality reduction aims to reduce the dimensions of data by obtaining a set of principal data or removing the redundant and dependent features [78]. It transforms the features from a high dimensional space to a low dimensional space. It could be applied to reduce the complexity, avoid overfitting, and reduce the influence of outliers. Feature selection is one of dimensionality reduction approaches. Feature selection is for selecting useful features from the original features or filtering irrelevant or redundant features from the original data set. The feature selection techniques can be broadly categorized into three types: the filter Chapter 2. Literature Review 24 feature selection methods, the wrapper feature selection methods and the embedded feature selection methods. The filter feature selection methods filter out unimportant or redundant features from the original data set based on certain criteria [79]. Mutual Information or correlation to select the most relevant features [80]. Feature selection for multi-labeled variables method selects features via maximizing conditional dependency between features [79]. An unsupervised filter feature selection method for mixed data (USFSM) evaluates the relevance of features by their contributions and defines good cluster structures by analysing the changes of spectrum of the normalized Laplacian matrix when a feature is excluded [81]. The filter techniques have advantages of their speed and scalability [82, 83]. Filter methods are useful for selecting a generic set of features for all the machine learning models. The filter techniques have advantages of their speed and scalability. However, in some cases, features selected through filter methods may not be the most optimal set of features for some specific algorithms. The wrapper feature selection methods are used to select the most optimal features for the specified algorithms [84, 85]. There are different wrapper approaches. A meta-heuristic wrapper method uses random encircling and imitative behavior of the Kestrel bird for optimal selection of features [86]. The sequential approach adds or removes features sequentially; the bio-inspired approach introduces randomness into the process to gain global optima; the iterative approach converts the feature selection problem to an estimation problem [81]. A sequential methods outputs both a ranking of relevant features and an optimal partition by using Mahalanobis metric (multivariate https://www.sciencedirect.com/topics/engineering/feature-extraction https://www.sciencedirect.com/topics/computer-science/cluster-structure https://www.sciencedirect.com/topics/computer-science/cluster-structure https://www.sciencedirect.com/topics/computer-science/laplacian-matrix https://www.sciencedirect.com/topics/computer-science/laplacian-matrix Chapter 2. Literature Review 25 distance metric which measures the distance between a data point and a distribution) and K-means clustering algorithm [84]. Localized feature selection (LFS), an iterative algorithm, uses a randomized rounding approach when weights of regions are fixed [85]. However, traditional wrapper methods usually have poor generalization ability, high complexity, low computational efficiency, and high computational cost [82, 87]. In embedded approaches of feature selection, the feature selection is an integrated part of the learning algorithm. The embedded approaches can be generally divided into two types: decision tree algorithms and regularization techniques. Decision tree algorithms select features recursively during the tree growth process [88, 89]. The tree growth process is also the process of feature selection. Some feature selection methods based on bee colony and gradient boosting decision tree [88]. Some use classification and regression tree-based (CART) decision tree algorithms to select features for 3D depth video [89]. L1-norm, L2-norm, or L2,1-norm have been used for feature selection in regularization techniques-based algorithms, which objective function is the minimization of the regularized cost. The key difference between the regularization techniques is the regularization term or penalty term. In L1-norm regularization, the absolute value of the magnitude of the coefficient is the penalty term. In L2-norm regularization, the squared magnitude of the coefficient is penalty term. In L2,1-norm regularization, the penalty term is a non-squared magnitude of the coefficient. For the matrix 𝐌𝐌 ∈ ℝ𝑛𝑛×𝑚𝑚, the L1-norm, L2-norm, and L2,1-norm are defined in Eq. 2.1, Eq. 2.2 and Eq. 2.3 respectively [90]: ‖𝐌𝐌‖1 = ∑ ∑ �𝑚𝑚𝑖𝑖,𝑗𝑗�𝑚𝑚 𝑗𝑗=1 𝑛𝑛 𝑖𝑖=1 (2.1) Chapter 2. Literature Review 26 ‖𝐌𝐌‖2 = (∑ ∑ 𝑚𝑚𝑖𝑖,𝑗𝑗 2𝑚𝑚 𝑗𝑗=1 𝑛𝑛 𝑖𝑖=1 ) (2.2) ‖𝐌𝐌‖2,1 = ∑ (∑ 𝑚𝑚𝑖𝑖,𝑗𝑗 2𝑚𝑚 𝑗𝑗=1 𝑛𝑛 𝑖𝑖=1 )1/2 (2.3) Specifically, L1-norm generates element-wise sparsity while L2,1-norm generates row-wise sparsity. That is, by using L2,1-norm penalty on the regularization term, it makes some rows of the generated projection matrix be 0. The redundant features are filtered out as unrepresentative features corresponding to row-wise sparsity do not participate in the clustering process, L2,1-norm-based algorithms have a better interpretability than L1-norm-based algorithms in feature selection models [91, 92]. L2 -norm can’t generate sparsity, which means it is lack of effectiveness in the feature selection model [93]. The L2,1-norm-based approaches are more robust than the L2- norm-based approaches. Recently the L2,1-norm has been used to improve the robustness of the feature selection algorithms [94, 95]. For instance, the L2,1-norm regularization term is imposed to the objective function to achieve feature selection and capture the discriminative structure information [94]. The L2,1-norm is used on both reconstruction error and the sparse constraint term to extract representative 2D image features [95]. L2,1-norm regularized regression model used for joint feature selection from multiple tasks. 2.3 Outlier Reduction Real data often contains outliers, which are data points inconsistent with most of the other data points in a given data set [96, 97]. The outliers could be resulted from an Chapter 2. Literature Review 27 inadequate procedure of data measure, collection, and data handling, or due to inherent variability in the underlying data domain. The outliers could significantly affect the clustering results. Outlier detection and robust clustering algorithms are often used to tackle the outlier problem. Outlier detection algorithms detect those outliers which are data points deviated from most of the other data points. Most of the existing outlier detection studies focus on unsupervised outlier detection [98]. Examples of outlier detection algorithms include distance-based outlier detection [99], dimension-based outlier detection [100], density-based outlier detection [101], frequent pattern based outlier detection [102, 103], and cluster-based outlier detection [104], etc. To minimize the impact of outliers, robust clustering has been intended from different areas. Some algorithms learn a robust metric to measure the similarity between points by taking the outliers into account [105, 106]; some algorithms use L1 or L2,1- norm to remove the outliers [107, 108]; some algorithms assign different weights to the data and the outliers during the clustering process [109]; some algorithms decompose outliers into a low-rank part [66, 110]; some algorithms conduct ensemble or fusion- based clustering algorithms combine different partitions results to deliver a more robust result [111, 112]. L1-norm, L2-norm, or L2,1-norm have been used on the regularization terms of the objective functions of clustering algorithm [113]. A non-convex multi-task generalization of the L2,1-norm regularization is used to learn a few features common across multiple tasks [114]. L2,1-norm regularized regression model used for joint Chapter 2. Literature Review 28 feature selection from multiple tasks [115]. L2,1-norm regularization encourages multiple predictors to share similar sparsity patterns [115]. Formally, let 𝐗𝐗 = (𝐱𝐱1, 𝐱𝐱2, … , 𝐱𝐱𝑛𝑛) ∈ ℝ𝑝𝑝×𝑛𝑛 , 𝐕𝐕 = (𝐯𝐯1, 𝐯𝐯2, … , 𝐯𝐯𝑛𝑛) ∈ ℝ𝑘𝑘×𝑛𝑛 , and 𝐔𝐔 ∈ ℝ𝑝𝑝×𝑘𝑘. In L1-norm-based robust clustering algorithms, the absolute value of the magnitude of the coefficient is used in the loss function. min 𝐔𝐔,𝐕𝐕 E1 (𝐔𝐔,𝐕𝐕) = ‖𝐗𝐗 − 𝐔𝐔𝐕𝐕‖1 (2.4) Specifically, L1-norm generates element-wise sparsity. As outliers corresponding to row-wise sparsity instead of element-wise sparsity, L1-norm based algorithms do not have a good interpretability in the outlier reduction. In L2-norm-based robust clustering algorithms, the squared magnitude of the coefficient is penalty term. min 𝐔𝐔,𝐕𝐕 E2 (𝐔𝐔,𝐕𝐕) = ‖𝐗𝐗 − 𝐔𝐔𝐕𝐕‖𝐹𝐹2 (2.5) The L2-norm is calculated as the square root of the sum of the squared vector values. For example, an outlier, its residual ‖𝐱𝐱𝑖𝑖 − 𝐔𝐔𝐯𝐯𝑖𝑖‖ is larger than residuals of other non-outliers. After squaring, the residual of the outlier could dominate the loss function. The L2-norm is also used to calculate the Euclidean distance of the vector coordinate from the origin of the vector space. Euclidean distance is often used in clustering algorithm to calculate the similarity. The L2-norm based is also called Euclidean norm. The L2,1-norm is defined in the following equation: min 𝐔𝐔,𝐕𝐕 E2,1 (𝐔𝐔,𝐕𝐕) = ‖𝐗𝐗 − 𝐔𝐔𝐕𝐕‖2,1 (2.6) Chapter 2. Literature Review 29 While L2,1-norm generates row-wise sparsity. As outliers corresponding to row- wise sparsity do not participate in the clustering process, L2,1-norm-based algorithms have a better interpretability than L1-norm-based algorithms in outlier removal [91, 92]. The residual ‖𝐱𝐱𝑖𝑖 − 𝐔𝐔𝐯𝐯𝑖𝑖‖ of an outlier is not squared, and thus reduces the influence of the outlier compared to L2-norm-based loss function. Thus, L2,1-norm-based algorithm could achieve more robust clustering results compared to L2-norm-based algorithm. The L2,1 performs more robustly and stable than L2 when outliers exist [116]. According to the structure of the constraints, the structural sparsity is often obtained by L2,1-norm. L2,1-norm regularization encourages multiple predictors to share similar sparsity patterns [115]. L2,1-norm-based function is robust to outliers [117, 118]. 2.4 Evaluation Measure To assess the performance of the proposed algorithms with related algorithms, we adopted three popular evaluation metrics of clustering algorithms including accuracy (ACC), normalized mutual information (NMI), and Purity [119]. ACC measures the percentage of samples correctly clustered. NMI measures the pairwise similarity between two partitions. Purity measures the percentage of each cluster containing the correctly clustered samples [11, 120]. The definitions of these three evaluation metrics are given below. ACC = 𝑁𝑁𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐/𝑁𝑁 (2.7) Chapter 2. Literature Review 30 where 𝑁𝑁𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 represents the number of correct clustered samples, and 𝑁𝑁 represents total number of samples. NMI (𝐴𝐴,𝐵𝐵) = ∑ ∑ 𝑛𝑛𝑖𝑖𝑖𝑖 𝐶𝐶𝐵𝐵 𝑖𝑖=1 𝐶𝐶𝐴𝐴 𝑖𝑖=1 log(𝑛𝑛𝑖𝑖𝑖𝑖𝑛𝑛 𝑛𝑛𝑖𝑖 𝐴𝐴⁄ 𝑛𝑛𝑖𝑖 𝐵𝐵) �∑ 𝑛𝑛𝑖𝑖 𝐴𝐴log (𝑛𝑛𝑖𝑖 𝐴𝐴 𝑛𝑛)∑ 𝑛𝑛𝑖𝑖 𝐵𝐵log (𝑛𝑛𝑖𝑖 𝐵𝐵 𝑛𝑛)�𝐶𝐶𝐵𝐵 𝑖𝑖=1�𝐶𝐶𝐴𝐴 𝑖𝑖=1 (2.8) where 𝐴𝐴 and 𝐵𝐵 represents two partitions of 𝑛𝑛 samples into 𝐶𝐶𝐴𝐴 and 𝐶𝐶𝐵𝐵 clusters respectively. Purity = ∑ (𝑆𝑆𝑖𝑖 𝑛𝑛⁄ )𝑘𝑘 𝑖𝑖=1 𝑃𝑃𝑖𝑖 (2.9) where 𝑘𝑘 represents number of clusters and 𝑛𝑛 represents total number of samples. 𝑆𝑆𝑖𝑖 represents the number of samples in the i-th cluster. 𝑃𝑃𝑖𝑖 represents the distribution of correctly clustered sample. To rank the performance of different algorithms, we used dense ranking which the highest accuracy rate receives number 1, and the next accuracy rate receives the immediately following ranking number. Same accuracy rates receive the same ranking number. Thus if A ranks ahead of B and C (which compare equal) which are both ranked ahead of D, then A gets ranking number 1 ("first"), B gets ranking number 2 ("joint second"), C also gets ranking number 2 ("joint second") and D gets ranking number 3 ("Third"). 2.5 Summary As one of the most famous and widely used clustering algorithms, K-means clustering algorithm still has its limitations. It is difficult to determine the cluster number K to Chapter 2. Literature Review 31 obtain a good clustering result without prior knowledge. Different initializations may obtain completely different clustering results. Using Euclidian distance as similarity measurement is limited for measuring the real-world data. Real-world data contains redundant features and outliers, without considering the reduction of the influence of redundant features and outliers is hard to achieve the optimal results. Existing methods only solved some of these problems. All these issues of K-means clustering algorithm are important to be addressed to improve K-means clustering algorithm. Chapter 3. Initialization-Similarity Clustering Algorithm 32 Chapter 3 Initialization-Similarity Clustering Algorithm 3.1 Introduction Due to random initialization and the Euclidian distance as similarity measure, K-means clustering algorithm does not guarantee to produce optimal and stable results. Many literatures have solved the part of issues problem of K-means clustering algorithm [4, 13-15, 121, 122]. However, previous research focused on solving a part of these issues but has not focused on solving the initialization and the similarity measure in a unified framework. As an innovative clustering method, spectral clustering algorithm has widely applied in the fields such as data mining, computer vision, machine learning, and pattern recognition over recent years [123, 124]. To fix the similarity measure issue of K-means clustering algorithm, Spectral clustering algorithm generates the similarity matrix, and then obtain the spectral representation, finally applies K-means clustering algorithm to get the final clustering results. Fixing one of the two issues does not guarantee the best performance. Solving similarity and initialization issues of K-means clustering algorithm simultaneously can be considered as an improvement over the existing algorithms because it could lead to better outputs. The proposed Initialization-Similarity (IS) clustering algorithm aims to solving the initialization and the similarity measure issues simultaneously. Specifically, we fix the initialization of the clustering by using sum-of-norms (SON) regularization [125]. Moreover, the SON regularization outputs the new representation of the original Chapter 3. Initialization-Similarity Clustering Algorithm 33 samples. The proposed IS clustering algorithm then learns the similarity matrix based on the data distribution. That is, the similarity is high if the distance of the new representation of the data points is small. Furthermore, the derived new representation is used to conduct K-means clustering. Finally, we employ an alternating strategy to solving the proposed objective function. Experimental results on real-world benchmark data sets demonstrate that IS clustering algorithm outperforms the comparison clustering algorithms in terms of three evaluation metrics for clustering algorithm including accuracy (ACC), normalized mutual information (NMI), and Purity. We briefly summarize the contributions of the proposed IS clustering algorithm as follows: • IS clustering algorithm fixes the initialization by using the sum-of-norms regularization makes the clustering robust and reproduced. In contrast, the previous clustering algorithm uses randomly selected cluster centers initialization to conduct K-means clustering and then outputs unstable or varying clustering results [126]. • Previous spectral clustering algorithm uses spectral representation to replace original representation for conducting K-means clustering. To do this, spectral clustering algorithm first generates the similarity matrix and then conducts eigenvalue decomposition on the Laplacian matrix of the similarity matrix to obtain the spectral representation. This is obviously a two-step strategy which the goal of the first step does not guarantee the best clustering result. However, IS clustering algorithm learns the similarity matrix and the new representation Chapter 3. Initialization-Similarity Clustering Algorithm 34 simultaneously. The performance is more promising when the two steps are combined in a unified way. • The experiment results on ten public data sets show that the proposed IS clustering algorithm outperforms both K-means clustering and spectral clustering algorithms. It implies that simultaneously addressing the two issues of K-means clustering algorithm is feasible and fitter. This section has laid the background of the research inquiry. The remainder of the paper is organized as follows: Section 3.2 discusses the motivation behind the development of IS clustering algorithm. Section 3.3 introduces the proposed Initialization-Similarity (IS) algorithm. Section 3.4 provides the optimization process. Section 3.5 provides the convergence analysis. Section 3.6 discusses the experiments we conducted and presents the results of the experiments. The conclusions, limitations and future research direction are presented in Section 3.7. 3.2 Motivation To discover how other algorithm improves K-means clustering algorithm, we investigated both K-means clustering algorithm and Spectral clustering algorithm, another widely used clustering algorithm, in details. K-means algorithm aims at minimizing the total intra-cluster variance represented by an objective function known as the squared error function shown in Eq. (3.1). Chapter 3. Initialization-Similarity Clustering Algorithm 35 ∑ 𝐾𝐾 𝑗𝑗=1 ∑ �𝑥𝑥𝑖𝑖 − ℎ𝑗𝑗� 2𝑑𝑑𝑖𝑖 𝑖𝑖=1 (3.1) where K is the cluster number, 𝑑𝑑𝑗𝑗 is the number of data points in the j-th cluster, 𝑥𝑥𝑖𝑖 is the i-th data point of cluster j. ℎ𝑗𝑗 is the cluster center of cluster j-th cluster. �𝑥𝑥𝑖𝑖 − ℎ𝑗𝑗� 2 is the Euclidean distance between 𝑥𝑥𝑖𝑖 and ℎ𝑗𝑗. K-means clustering algorithm can be reformulated as the formulation of nonnegative matrix factorization as following Eq. (3.2) [127]: min 𝐇𝐇,𝐅𝐅 ‖𝐗𝐗 − 𝐅𝐅𝐇𝐇‖F2 (3.2) where 𝐅𝐅 ∈ ℝ𝑛𝑛×𝑘𝑘 is the cluster indicator matrix of 𝐗𝐗 ∈ ℝ𝑛𝑛×𝑘𝑘 and 𝐇𝐇 ∈ ℝ𝑘𝑘×d is the cluster center matrix. K-means clustering algorithm randomly chooses the initial cluster centers. Based on both Eq. (3.1) and Eq. (3.2) , it is obvious that different initialization methods may have different effects on the clustering results [128, 129]. This implies that it is difficult to reproduce the clustering results. Some algorithms were developed to address this issue. For example, the algorithm used for novel centroid selection approaches for K-means-clustering based recommender systems first select one random data point as initial cluster center, then select next cluster center with probability until all K cluster centers are found. The first cluster center is still selected randomly, which will affect the clustering results [128]. Random swap-based algorithms such as an efficiency of random swap clustering algorithm first select the cluster centers randomly, then randomly select one cluster center to be removed and replace it to a randomly selected cluster. This is a trial-and-error approach and it doesn’t have clear iteration times [130]. Chapter 3. Initialization-Similarity Clustering Algorithm 36 Moreover, Eq. (3.2) also shows that the outcome of the K-means clustering objective function only depends on Euclidean distance between the data points and the cluster center, which is how K-means clustering algorithm defines the similarity measure between two data points. The smaller the distance between two data points, the more similar the two data points are. The larger the distance between two data points, the more dissimilar the two data points are. Euclidean distance does not reveal other underlying factors such as cluster sizes, shape, dependent features or density [18, 30]. Thus the similarity measure is an issue of K-means clustering algorithm. To address the similarity measure issue of K-means algorithm, spectral clustering algorithm uses spectral representation to replace original representation. To achieve this, spectral clustering algorithm first builds a similarity matrix and conducts eigenvalue decomposition on its Laplacian matrix to obtain the spectral representation. The pseudo code for K-means clustering algorithm is list in Table 3.1. Table 3.1 The pseudo code for K-means clustering algorithm A spectral clustering algorithm creates a similarity matrix first and then defines a feature vector. Then it runs the K-means clustering algorithm to conduct clustering Input: X (data matrix), K (the cluster number) Output: K cluster centers and the cluster indicator of each data point Initialization: Random selecting K cluster centers h1, h2 … h𝑘𝑘; Repeat: 1. Assign each data point x𝑖𝑖 to nearest cluster j using Euclidian distance; 2. Recalculating the new cluster centers h1, h2 … h𝑘𝑘; Until convergence (the cluster indicator of each data points unchanged); Chapter 3. Initialization-Similarity Clustering Algorithm 37 [61]. Thus, a spectral clustering algorithm finds the data similarity matrix and spectral representation in separate stages. Of course, its use of the K-means clustering algorithm requires the cluster number beforehand. Other algorithms e.g. CAN learn the data similarity matrix and clustering structure simultaneously, but again needs to know the cluster number beforehand. In the algorithm RCC, clustering is managed without the prior knowledge of the cluster number by continuously optimizing an objective function based on robust estimation. However, this needs a good similarity matrix calculated beforehand as an input to be able to produce good clustering outcome. The pseudo code for spectral clustering algorithm is shown in Table 3.2. Table 3.2 The pseudo code for the spectral clustering algorithm Input: X∈ ℝ𝑛𝑛×𝑑𝑑 (data matrix), K (the cluster number) Output: K cluster center and the cluster indicator of each data point • Computing 𝐒𝐒 ∈ ℝ𝑛𝑛×𝑛𝑛 to measure the similarity between any data point pair; • Computing L = D – S, where 𝐃𝐃 = [𝑑𝑑𝑖𝑖𝑗𝑗]𝑛𝑛×𝑛𝑛 is a diagonal matrix and 𝑑𝑑𝑖𝑖𝑖𝑖 = ∑ (s𝑖𝑖𝑗𝑗 + s𝑗𝑗𝑖𝑖)/2 𝑗𝑗 ; • Generating spectral representation using the eigenvectors and eigenvalues of L; • Conducting K-means clustering on the spectral representation; Obviously, spectral clustering algorithm replacing original representation with spectral representation deals the issue of similarity measure in K-means clustering algorithm. However, spectral clustering algorithm separately learns the similarity matrix and the spectral representation, as knowns as a two-stage strategy, where the goal of constructing the similarity matrix in the first stage does not aim at achieving Chapter 3. Initialization-Similarity Clustering Algorithm 38 optimal spectral representation, and thus not guaranteeing to always outperform K- means clustering algorithm. 3.3 Proposed Algorithm This thesis proposes a new clustering algorithm (i.e., Initialization-Similarity (IS)) to simultaneously solve the initialization and similarity measure issues of K-means clustering algorithm in a unified framework. Specifically, IS clustering algorithm uses the sum-of-norms regularization to investigate the initialization issue, and jointly learns the similarity matrix and the spectral representation to overcome the issue of the multi- stage strategy of spectral clustering algorithm. To achieve the goal, we form the objective function of the IS clustering algorithm as follows: min𝐒𝐒,𝐔𝐔 1 2 ‖𝐗𝐗 − 𝐔𝐔‖𝐹𝐹2 + 𝛼𝛼 2 ∑ 𝑠𝑠𝑖𝑖,𝑗𝑗𝜌𝜌(�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 𝑛𝑛 𝑖𝑖,𝑗𝑗=1 ) + 𝛽𝛽‖𝐒𝐒‖22, 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝐬𝐬𝑖𝑖𝑇𝑇𝐞𝐞 = 1 (3.3) where 𝐗𝐗 ∈ ℝ𝑛𝑛×𝑑𝑑 is the data matrix, 𝐔𝐔 ∈ ℝ𝑛𝑛×𝑑𝑑 is the new representation of 𝐗𝐗, and 𝐒𝐒 ∈ ℝ𝑛𝑛×𝑛𝑛 is the similarity matrix to measure the similarity among data points. 𝜌𝜌(�u𝑖𝑖 − u𝑗𝑗�2 ) is an implicit function, as known as robust loss function in robust statistics. Equation. (3.3) aims at learning the new representation U and fixes the initialization of clustering. Moreover, Eq. (3.3) learns the new representation U as well as considers the similarity among data points, i.e., the higher the similarity 𝑠𝑠𝑖𝑖,𝑗𝑗 between two data points, the smaller their corresponding new representation (𝐮𝐮𝑖𝑖 and 𝐮𝐮𝑗𝑗 ) is. Chapter 3. Initialization-Similarity Clustering Algorithm 39 Furthermore, we learn the similarity matrix 𝐒𝐒 based on the sample distribution, i.e., iteratively updated by the updated U. This makes the new representation reasonable. Several robust loss functions have been proposed in robust statistics [131, 132]. In this thesis, we employ the Geman-McClure function [133] as follows: ρ ��𝐮𝐮𝑝𝑝 − 𝐮𝐮𝑞𝑞�2 � = 𝜇𝜇�𝐮𝐮𝑝𝑝−𝐮𝐮𝑞𝑞�2 2 𝜇𝜇+�𝐮𝐮𝑝𝑝−𝐮𝐮𝑞𝑞�2 2 (3.4) Equation. (3.4) is often used to measure how good a prediction model does in terms of being able to predict the expected outcome. The closer the distance is, the smaller value of �𝐮𝐮𝑝𝑝 − 𝐮𝐮𝑞𝑞�2 is, and the higher the similarity 𝑠𝑠𝑝𝑝,𝑞𝑞 is. With the update of other parameters in Eq. (3.3), the distance �𝐮𝐮𝑝𝑝 − 𝐮𝐮𝑞𝑞�2 for some 𝑝𝑝, 𝑞𝑞, will be very close, or even 𝐮𝐮𝑝𝑝 = 𝐮𝐮𝑞𝑞. In this way, the clusters will be determined. Algorithm 3.1. The pseudo code for IS clustering algorithm. Input: X∈ ℝ𝑛𝑛×𝑑𝑑 Output: a set of K clusters Initialization: U = X; Repeat: • Update 𝐅𝐅 using Eq. (3.13) • Update S using Eq. (3.22) • Update U using Eq. (3.36) Until U converges • Apply K-means clustering algorithm on U In robust statistics, the optimization of the robust loss function is usually difficult or inefficient. To address this, it is normal for introducing an auxiliary variable 𝑓𝑓𝑖𝑖,𝑗𝑗 and a penalty item 𝜑𝜑(𝑓𝑓𝑖𝑖,𝑗𝑗) [134-136], and thus Eq. (3.3) is equivalent to: min𝐒𝐒,𝐔𝐔,𝐅𝐅 1 2 � ‖𝐱𝐱𝑖𝑖 − 𝐮𝐮𝑖𝑖‖22 𝑛𝑛 𝑖𝑖=1 + α 2 � 𝑠𝑠𝑖𝑖,𝑗𝑗(𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 2𝑛𝑛 𝑖𝑖,𝑗𝑗=1 Chapter 3. Initialization-Similarity Clustering Algorithm 40 +φ(𝑓𝑓𝑖𝑖,𝑗𝑗)) + 𝛽𝛽 ∑ ‖𝐬𝐬𝑖𝑖‖22𝑛𝑛 𝑖𝑖=1 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝒔𝒔𝑖𝑖𝑇𝑇𝒆𝒆 = 1 (3.5) where 𝜑𝜑�𝑓𝑓𝑖𝑖,𝑗𝑗� = 𝜇𝜇(�𝑓𝑓𝑖𝑖,𝑗𝑗 − 1)2, 𝑖𝑖, 𝑗𝑗 = 1 …𝑛𝑛 3.4 Optimization Equation. (3.5) is not jointly convex on 𝐅𝐅, 𝐔𝐔, and 𝐒𝐒, but is convex on each variable while fixing the rest. To solving the Eq. (3.5), the alternating optimization strategy is applied. We optimize each variable while fixing the rest until the algorithm converges. The pseudo-code of IS clustering algorithm is given in Algorithm 3.1. 1) Update 𝐅𝐅 while fixing 𝐒𝐒 and 𝐔𝐔 While 𝐒𝐒 and 𝐔𝐔 are fixed, the objective function can be rewritten in a simplified matrix form to optimize 𝐅𝐅: min𝐅𝐅 𝛼𝛼 2 ∑ 𝑠𝑠𝑖𝑖,𝑗𝑗(𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 2𝑛𝑛 𝑖𝑖,𝑗𝑗=1 + 𝜇𝜇(�𝑓𝑓𝑖𝑖,𝑗𝑗 − 1)2) (3.6) Since the optimization of 𝑓𝑓𝑖𝑖,𝑗𝑗 is independent of the optimization of other 𝑓𝑓𝑝𝑝,𝑞𝑞, 𝑖𝑖 ≠ 𝑝𝑝, 𝑗𝑗 ≠ 𝑞𝑞, the 𝑓𝑓𝑖𝑖,𝑗𝑗 is optimized first as shown in following Eq. (3.7) min𝑓𝑓𝑖𝑖,𝑖𝑖 𝛼𝛼 2 (𝑠𝑠𝑖𝑖,𝑗𝑗𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 2 + 𝑠𝑠𝑖𝑖,𝑗𝑗�𝜇𝜇�𝑓𝑓𝑖𝑖,𝑗𝑗 − 2�𝑓𝑓𝑖𝑖,𝑗𝑗 + 1� � (3.7) By conducting a derivative on Eq. (3.7) with respect to 𝑓𝑓𝑖𝑖,𝑗𝑗, we get 𝛼𝛼 2 (𝑠𝑠𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 2 + 𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇−𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇𝑓𝑓𝑖𝑖,𝑗𝑗 −12 ) = 0 (3.8) ⇒ 𝛼𝛼 2 𝑠𝑠𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 2 + 𝛼𝛼 2 𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇− 𝛼𝛼 2 𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇𝑓𝑓𝑖𝑖,𝑗𝑗 −12 = 0 (3.9) Chapter 3. Initialization-Similarity Clustering Algorithm 41 ⇒ 𝛼𝛼 2 𝑠𝑠𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 2 + 𝛼𝛼 2 𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇 = 𝛼𝛼 2 𝑠𝑠𝑖𝑖,𝑗𝑗𝜇𝜇𝑓𝑓𝑖𝑖,𝑗𝑗 −12 (3.10) ⇒ 𝑓𝑓𝑖𝑖,𝑗𝑗 −12 = �𝐮𝐮𝑖𝑖−𝐮𝐮𝑖𝑖�2 2 + 𝜇𝜇 𝜇𝜇 (3.11) ⇒ 𝑓𝑓𝑖𝑖,𝑗𝑗 1 2 = 𝜇𝜇 𝜇𝜇+�𝐮𝐮𝑖𝑖−𝐮𝐮𝑖𝑖�2 2 (3.12) ⇒ 𝑓𝑓𝑖𝑖,𝑗𝑗 = � 𝜇𝜇 𝜇𝜇+�𝐮𝐮𝑖𝑖−𝐮𝐮𝑖𝑖�2 2� 2 (3.13) 2) Update 𝐒𝐒 while fixing 𝐔𝐔 and 𝐅𝐅 While fixing 𝐔𝐔 and 𝐅𝐅, the objective function Eq. (3.5) with respect to 𝐒𝐒 is: min𝐒𝐒 𝛼𝛼 2 ∑ (𝑠𝑠𝑖𝑖,𝑗𝑗𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 2𝑛𝑛 𝑖𝑖,𝑗𝑗=1 + 𝑠𝑠𝑖𝑖,𝑗𝑗(𝜇𝜇(�𝑓𝑓𝑖𝑖,𝑗𝑗 − 1)2)) + 𝛽𝛽 ∑ ‖𝐬𝐬𝑖𝑖‖22𝑛𝑛 𝑖𝑖=1 (3.14) 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, s𝑖𝑖𝑇𝑇e = I Since the optimization of 𝐬𝐬i is independent of the optimization of other 𝐬𝐬𝑗𝑗 , 𝑖𝑖 ≠ j, 𝑖𝑖, 𝑗𝑗 = 1, … , n, the 𝐬𝐬𝑖𝑖 is optimized first as shown in following: min𝐬𝐬𝑖𝑖 𝛼𝛼 2 ∑ 𝑠𝑠𝑖𝑖,𝑗𝑗(𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 2𝑛𝑛 𝑗𝑗=1 + 𝜇𝜇(�𝑓𝑓𝑖𝑖,𝑗𝑗 − 1)2) + 𝛽𝛽‖𝐬𝐬𝑖𝑖‖22 (3.15) 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, s𝑖𝑖𝑇𝑇e = 1 Let 𝑏𝑏𝑖𝑖,𝑗𝑗 = 𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 2 and 𝑐𝑐𝑖𝑖,𝑗𝑗 = 𝜇𝜇(�𝑓𝑓𝑖𝑖,𝑗𝑗 − 1)2, Eq. (3.15) is equivalent to: min𝐬𝐬𝑖𝑖 𝛼𝛼 2 ∑ 𝑠𝑠𝑖𝑖,𝑗𝑗𝑏𝑏𝑖𝑖,𝑗𝑗𝑛𝑛 𝑗𝑗=1 + 𝛼𝛼 2 ∑ 𝑠𝑠𝑖𝑖,𝑗𝑗𝑐𝑐𝑖𝑖,𝑗𝑗𝑛𝑛 𝑗𝑗=1 + 𝛽𝛽‖𝐬𝐬𝑖𝑖‖22, 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝒔𝒔𝑖𝑖𝑇𝑇𝑒𝑒 = 1 (3.16) Chapter 3. Initialization-Similarity Clustering Algorithm 42 ⇒ min𝐬𝐬𝑖𝑖 𝛼𝛼 2 𝐬𝐬𝑖𝑖𝑇𝑇𝐛𝐛𝑖𝑖 + 𝛼𝛼 2 𝐬𝐬𝑖𝑖𝑇𝑇𝐜𝐜𝑖𝑖 + 𝛽𝛽‖𝐬𝐬𝑖𝑖‖22, 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝐬𝐬𝑖𝑖𝑇𝑇𝑒𝑒 = 1 (3.17) ⇒ min𝐬𝐬𝑖𝑖 𝛼𝛼 2 𝐬𝐬𝑖𝑖𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖) + 𝛽𝛽𝐬𝐬𝑖𝑖𝑇𝑇𝐬𝐬𝑖𝑖 , 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝐬𝐬𝑖𝑖𝑇𝑇𝑒𝑒 = 1 (3.18) ⇒ min𝐬𝐬𝑖𝑖 𝛼𝛼 2𝛽𝛽 𝐬𝐬𝑖𝑖𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖) + 𝐬𝐬𝑖𝑖𝑇𝑇𝐬𝐬𝑖𝑖 , 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, s𝑖𝑖𝑇𝑇𝑒𝑒 = 1 (3.19) ⇒ min𝐬𝐬𝑖𝑖 𝐬𝐬𝑖𝑖 𝑇𝑇𝐬𝐬𝑖𝑖 + 2𝐬𝐬𝑖𝑖 𝛼𝛼 4𝛽𝛽 𝐬𝐬𝑖𝑖𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖) + 𝛼𝛼 4𝛽𝛽 𝐬𝐬𝑖𝑖𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖)𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖) − 𝛼𝛼 4𝛽𝛽 s𝑖𝑖𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖)𝑇𝑇(𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖), 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝐬𝐬𝑖𝑖𝑇𝑇𝑒𝑒 = 1 (3.20) ⇒ min𝐬𝐬𝑖𝑖 �𝐬𝐬𝑖𝑖 + 𝛼𝛼 4𝛽𝛽 (𝐛𝐛𝑖𝑖 + 𝐜𝐜𝑖𝑖)� 2 2 , 𝑠𝑠. 𝑡𝑡. ,∀𝑖𝑖, 𝑠𝑠𝑖𝑖,𝑗𝑗 ≥ 0, 𝐬𝐬𝑖𝑖𝑇𝑇e = 1 (3.21) According to Karush-Kuhn-Tucker (KKT) [137], the optimal solution 𝐬𝐬𝑖𝑖 should be S𝑖𝑖,𝑗𝑗 = max {− 𝛼𝛼 4𝛽𝛽 (b𝑖𝑖,𝑗𝑗 + c𝑖𝑖,𝑗𝑗)} + 𝜃𝜃, 0}, 𝑗𝑗 = 1, … ,𝑛𝑛 (3.22) where 𝜃𝜃 = 1 𝜌𝜌 ∑ � 𝛼𝛼 4𝛽𝛽 (b𝑖𝑖,𝑗𝑗 + c𝑖𝑖,𝑗𝑗) + 1�𝜌𝜌 𝑗𝑗=1 , and 𝜔𝜔 is the descending order of 𝛼𝛼 4𝛽𝛽 (b𝑖𝑖,𝑗𝑗 + c𝑖𝑖,𝑗𝑗). and 𝜌𝜌 = {𝝎𝝎𝑗𝑗 − 1 𝑗𝑗 �∑ 𝝎𝝎𝑐𝑐 𝑗𝑗 𝑐𝑐=1 − 1�, 0}𝑗𝑗 𝑚𝑚𝑚𝑚𝑚𝑚 . 3) Update 𝐔𝐔 while fixing 𝐒𝐒 and 𝐅𝐅 While 𝐒𝐒 and 𝐅𝐅 are fixed, the objective function can be rewritten in a simplified form to optimize 𝐔𝐔: min𝐔𝐔 1 2 ∑ ‖𝐱𝐱𝑖𝑖 − 𝐮𝐮𝑖𝑖‖22𝑛𝑛 𝑖𝑖,𝑗𝑗=1 + 𝛼𝛼 2 ∑ 𝑠𝑠𝑖𝑖,𝑗𝑗𝑓𝑓𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 2𝑛𝑛 𝑖𝑖,𝑗𝑗=1 (3.23) Let ℎ𝑖𝑖,𝑗𝑗 = 𝑠𝑠𝑖𝑖,𝑗𝑗𝑓𝑓𝑖𝑖,𝑗𝑗. Eq. (3.23) is equivalent to: Chapter 3. Initialization-Similarity Clustering Algorithm 43 min𝐔𝐔 1 2 ‖𝐗𝐗 − 𝐔𝐔‖𝐹𝐹2 + 𝛼𝛼 2 ∑ 𝑛𝑛 𝑖𝑖,𝑗𝑗=1 ℎ𝑖𝑖,𝑗𝑗�𝐮𝐮𝑖𝑖 − 𝐮𝐮𝑗𝑗�2 2 (3.24) ⇒ min𝐔𝐔 1 2 ‖𝐗𝐗 − 𝐔𝐔‖𝐹𝐹2 + 𝛼𝛼 2 𝑡𝑡𝑡𝑡(𝐔𝐔𝑇𝑇𝐋𝐋𝐔𝐔) (3.25) ⇒ min𝐔𝐔 1 2 𝑡𝑡𝑡𝑡((𝐗𝐗 − 𝐔𝐔)𝑇𝑇 (𝐗𝐗 − 𝐔𝐔) ) + 𝛼𝛼 2 𝑡𝑡𝑡𝑡(𝐔𝐔𝑇𝑇𝐋𝐋𝐔𝐔) (3.26) ⇒ min𝐔𝐔 1 2 𝑡𝑡𝑡𝑡((𝐗𝐗𝑇𝑇 − 𝐔𝐔𝑇𝑇) (𝐗𝐗 − 𝐔𝐔) ) + 𝛼𝛼 2 𝑡𝑡𝑡𝑡(𝐔𝐔𝑇𝑇𝐋𝐋𝐔𝐔) (3.27) ⇒ min𝐔𝐔 1 2 𝑡𝑡𝑡𝑡(𝐗𝐗𝑇𝑇𝐗𝐗 − 2𝐔𝐔𝑇𝑇𝐗𝐗 + 𝐔𝐔𝑇𝑇𝐔𝐔 ) + 𝛼𝛼 2 𝑡𝑡𝑡𝑡(𝐔𝐔𝑇𝑇𝐋𝐋𝐔𝐔) (3.28) After conducting a derivative on Eq. (3.28) with respect to U, we get ⇒ 1 2 (−2𝐗𝐗 + 2𝐔𝐔 ) + 𝛼𝛼 2 (𝐋𝐋𝐔𝐔 + 𝐋𝐋𝑇𝑇𝐔𝐔) = 0 (3.29) ⇒ −𝐗𝐗 + 𝐔𝐔 + 𝛼𝛼 2 𝐋𝐋𝐔𝐔 + 𝛼𝛼 2 𝐋𝐋𝑇𝑇𝐔𝐔 = 0 (3.30) ⇒ 𝐔𝐔 + 𝛼𝛼 2 𝐋𝐋𝐔𝐔 + 𝛼𝛼 2 𝐋𝐋𝑇𝑇𝐔𝐔 = 𝐗𝐗 (3.31) ⇒ (1 + 𝛼𝛼 2 𝐋𝐋 + 𝛼𝛼 2 𝐋𝐋𝑇𝑇)𝐔𝐔 = 𝐗𝐗 (3.32) ⇒ (1 + 𝛼𝛼 2 (𝐋𝐋 + 𝐋𝐋𝑇𝑇)𝐔𝐔 = 𝐗𝐗 (3.33) ⇒ (1 + 𝛼𝛼 2 (2𝐋𝐋)𝐔𝐔 = 𝐗𝐗 (3.34) ⇒ (1 + 𝛼𝛼𝐋𝐋)𝐔𝐔 = 𝐗𝐗 (3.35) ⇒ 𝐔𝐔 = (I + 𝛼𝛼𝐋𝐋)−1𝐗𝐗 (3.36) Chapter 3. Initialization-Similarity Clustering Algorithm 44 3.5 Convergence Analysis In this section, we prove the convergence of the proposed IS clustering algorithm in order to prove the proposed algorithm can reach at least a locally optimal solution, so we apply Theorem 1. Theorem 1. IS clustering algorithm decreases the objective function value of Eq. (3.5) until it converges. Proof. By denoting 𝐅𝐅(𝑐𝑐) , 𝐒𝐒(𝑐𝑐), and 𝐔𝐔(𝑐𝑐), the results of the t-th iteration of 𝐅𝐅 , 𝐒𝐒 , and 𝐔𝐔 respectively, we further denote the objective function value of Eq. (3.5) in the t-th iteration as ℒ�𝐅𝐅(𝑐𝑐), 𝐒𝐒(𝑐𝑐),𝐔𝐔(𝑐𝑐)�. According to Eq. (3.13) in Section 3.4, 𝐅𝐅 has a closed-form solution, thus we have the following inequality: ℒ�𝐅𝐅(𝑐𝑐), 𝐒𝐒(𝑐𝑐),𝐔𝐔(𝑐𝑐)� ≥ ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐),𝐔𝐔(𝑐𝑐)� (3.37) According to Eq. (3.22), 𝐒𝐒 has a closed-form solution, thus we have the following inequality: ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐),𝐔𝐔(𝑐𝑐)� ≥ ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐+1),𝐔𝐔(𝑐𝑐)� (3.38) According to Eq. (3.36), 𝐔𝐔 has a closed-form solution, thus we have the following inequality: ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐+1),𝐔𝐔(𝑐𝑐)� ≥ ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐+1),𝐔𝐔(𝑐𝑐+1)� (3.39) Finally, based on above three inequalities, we get Chapter 3. Initialization-Similarity Clustering Algorithm 45 ℒ�𝐅𝐅(𝑐𝑐), 𝐒𝐒(𝑐𝑐),𝐔𝐔(𝑐𝑐)� ≥ ℒ�𝐅𝐅(𝑐𝑐+1), 𝐒𝐒(𝑐𝑐+1),𝐔𝐔(𝑐𝑐+1)� (3.40) Equation. (3.40) indicates that the objective function value in Eq. (3.5) decreases after each iteration of Algorithm 3.1. This concludes the proof of Theorem 1. 3.6 Experiments In this section, we evaluated the performance of the proposed Initialization-Similarity (IS) algorithm, by comparing it with two benchmark algorithms on ten real UCI data sets, in terms of three evaluation metrics [138]. Table 3.3 Description of ten benchmark data sets Datasets Samples Dimensions Classes Digital 1797 64 10 MSRA 1799 256 12 Segment 2310 19 7 Solar 323 12 6 USPS 1854 256 10 USPST 2007 256 10 Waveform 5000 21 3 Wine 178 13 3 Wireless 2000 7 4 Yale 165 1024 15 3.6.1 Data Sets We used ten UCI data sets in the experiments, including the standard data sets for handwritten digit recognition, face data sets, and wine data sets, etc. The details are listed in the following and summarization provide in Table 3.3. Chapter 3. Initialization-Similarity Clustering Algorithm 46 • Digital data set is made up of 1797 images (8x8). Each image is a hand-written digit 1-10. • MSRA data set is a face image data set. • Segment contains the instances drawn randomly from a database of 7 outdoor images. It has 2310 instances and 19 continuous attributes describing the images including saturation, Hue, etc. • Solar data set describes the main characteristics of the solar flare. • USPS is one of the standard handwritten digit recognition data sets. It contains the images of number from 0 to 9. • USPST contains 2007 handwritten digit recognition data sets. • Waveform data set has 5000 instances and 3 classes of waves with 21 attributes. • Wine data set is the results of a chemical analysis of wines with three different cultivars. It contains data of 13 constituents found in each of the three types of wines. • Wireless data set collected 2000 instances of the signal strengths of seven WiFi signals visible on a smartphone. • Yale data set contains 165 grayscale images (32x32) of 15 individuals. Each subject has different facial expression or configuration. The decision variable is one of the four rooms. Chapter 3. Initialization-Similarity Clustering Algorithm 47 3.6.2 Comparison Algorithms Two comparison algorithms are classical clustering algorithms and their details were summarized below. • K-means clustering algorithm (re)assigns data points to their nearest cluster center and recalculates cluster centers iteratively with a goal to minimize the sum of distances between data points and cluster center. • Spectral clustering algorithm first forms the similarity matrix, and then calculates the first K eigenvectors of its Laplacian matrix to define feature vectors. Finally, it runs K-means clustering on these features to separate objects into K classes. There are different ways to calculate the Laplacian matrix. Instead of using simple Laplacian, we used normalized Laplacian 𝐋𝐋 = 𝐃𝐃 × 𝐋𝐋 × 𝐃𝐃 , which have better performance than using simple Laplacian [139]. For the above two algorithms, K-means clustering conducts clustering directly on the original data while spectral clustering is a multi-stage based strategy, which constructs a graph first and then applies K-means clustering algorithm to partition the graph. 3.6.3 Experiment Setup In the experiments, firstly, we tested the robustness of the proposed IS clustering algorithm by comparing it with K-means clustering and spectral clustering algorithms using real data sets in terms of three evaluation metrics widely used for clustering Chapter 3. Initialization-Similarity Clustering Algorithm 48 research. Due to the sensitivity of K-means clustering to its initial cluster centers, we ran K-means clustering and spectral clustering algorithms 20 times and chose the average value as the final result. Secondly, we investigated the parameters’ sensitivity of the proposed IS clustering algorithm (i.e. α and β in Eq. (3.5)) via varying their values to observe the variations of clustering performance. Thirdly, we demonstrated the convergence of Algorithm 3.1 to solving the proposed objective function Eq. (3.5) via checking the iteration times when Algorithm 3.1 converges. 3.6.4 Experimental Results Analysis We listed the clustering performance of all algorithms in Table 3.5, which shows that our IS clustering algorithm achieved the best performance on all ten data sets in terms of ACC and NMI, as well as outperformed K-means clustering algorithm on all ten data sets in terms of Purity. IS clustering algorithm outperformed spectral clustering algorithm on all eight data sets in terms of Purity but performed slightly worse than spectral clustering algorithm on three data sets USPT, USPST and Yale. The difference in Purity results between IS clustering algorithm and the spectral clustering algorithm was only 1%. More specifically, IS clustering algorithm increased ACC by 6.3% compared to K-means clustering algorithm and 3.3% compared to spectral clustering algorithm. IS clustering algorithm increased NMI by 4.6% compared to K-means clustering algorithm and 4.5% compared to spectral clustering algorithm. IS clustering algorithm increased Purity by 4.9% compared to K-means clustering algorithm and Chapter 3. Initialization-Similarity Clustering Algorithm 49 2.9% compared to spectral clustering algorithm. Other observations were listed in the following sections. First, both one-step clustering algorithm, e.g. IS clustering algorithm and two- step clustering algorithm, e.g. spectral clustering algorithm outperformed K-means clustering algorithm. This implied that constructing the graph or learning a new representation of original data points improved the clustering performance. This means that using new representation can generate better clustering than the methods using original data in clustering tasks. The reason could be that original data generally contains more or less redundant information, which is always true in real data set and the redundancy undoubtedly corrupts the performance of clustering models. In contrast, two similarity matrix-based methods construct the new representation based on original data to conduct clustering, which can relieve the affection of redundancy from original data, so the clustering performance can be improved. Second, one-step clustering algorithm, e.g. IS clustering algorithm, performed better than two-step clustering algorithms, e.g. spectral clustering algorithm. Compared to the spectral clustering algorithm that first uses the original data to construct the similarity matrix and then uses the orthogonal decomposition onto the similarity matrix to output new representation, our method employed an adaptive learning strategy to dynamically update the similarity matrix and new representation in a unified framework. In this way, both new representation and similarity of our method can capture the intrinsic correlation of data, which means our method can easily output better clustering results than classical spectral clustering methods. This proves that the Chapter 3. Initialization-Similarity Clustering Algorithm 50 goal of the similarity matrix learning and the new representation are the same which leads to optimal clustering results, whereas the two-step clustering algorithm with separate goals achieves sub-optimal results. Table 3.4 ACC results of IS algorithm on ten benchmark data sets The highest score of each evaluation metric for each data set is highlighted in bold font. Datasets K-means Spectral IS Digital 0.73 0.77 0.80 MSRA 0.49 0.50 0.57 Segment 0.55 0.56 0.63 Solar 0.50 0.51 0.55 USPS 0.62 0.67 0.70 USPST 0.66 0.70 0.71 Waveform 0.50 0.51 0.57 Wine 0.65 0.69 0.71 Wireless 0.94 0.96 0.97 Yale 0.39 0.45 0.46 Rank 3 2 1 Table 3.5 NMI results of IS algorithm on ten benchmark data sets The highest score of each evaluation metric for each data set is highlighted in bold font. Datasets K-means Spectral IS Digital 0.73 0.72 0.78 MSRA 0.59 0.56 0.63 Segment 0.61 0.52 0.63 Solar 0.34 0.34 0.42 USPS 0.61 0.66 0.70 USPST 0.61 0.66 0.68 Waveform 0.36 0.37 0.40 Wine 0.43 0.42 0.43 Wireless 0.88 0.89 0.91 Yale 0.47 0.51 0.51 Rank 2 2 1 Chapter 3. Initialization-Similarity Clustering Algorithm 51 Table 3.6 Purity results of IS algorithm on ten benchmark data sets The highest score of each evaluation metric for each data set is highlighted in bold font Datasets K-means Spectral IS Digital 0.76 0.78 0.81 MSRA 0.53 0.53 0.58 Segment 0.58 0.58 0.64 Solar 0.55 0.55 0.61 USPS 0.69 0.75 0.74 USPST 0.71 0.77 0.76 Waveform 0.53 0.51 0.59 Wine 0.69 0.69 0.71 Wireless 0.94 0.96 0.97 Yale 0.41 0.47 0.46 Rank 3 2 1 Figure 3.1 ACC results of IS algorithm on ten benchmark data sets 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 K-means Spectral IS Chapter 3. Initialization-Similarity Clustering Algorithm 52 Figure 3.2 NMI results of IS algorithm on ten benchmark data sets Figure 3.3 Purity results of IS algorithm on ten benchmark data sets 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 K-means Spectral IS 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 K-means Spectral IS Chapter 3. Initialization-Similarity Clustering Algorithm 53 3.6.5 Parameters’ Sensitivity We varied parameters 𝛼𝛼 and 𝛽𝛽 in the range of [10−2, . . . 102], and recorded the values of ACC, NMI and Purity of ten data sets clustering results for IS clustering algorithm in Figures 3.4-3.6. First, different data sets needed different ranges of parameters to achieve the best performance. For example, IS clustering algorithm achieved the best ACC (97%), NMI (91%) and Purity (97%) on data set Wireless when both parameters 𝛼𝛼 and 𝛽𝛽 were 10. But for the data set Digital, IS clustering algorithm achieved the best ACC (80%), NMI (78%) and Purity (81%) when 𝛽𝛽 = 100 and 𝛼𝛼 =0.1. This indicated that IS clustering algorithm was data-driven. Second, the clustering ACC results had less than 3% average changes when the parameter 𝛼𝛼 varied in the range of [10−2, . . . 102] in eight out of ten data sets. The lowest average change was 1% (i.e., Wine and Wireless data sets) when the parameter 𝛼𝛼 varied in the range of [10−2, . . . 102]. The biggest average change was 5% (e.g., Waveform data set) when the parameter 𝛼𝛼 varied in the range of [10−2, . . . 102]. This indicated that IS clustering algorithm was not very sensitive to the parameter 𝛼𝛼. Third, the c