Initialization-similarity clustering algorithm

Classic k-means clustering algorithm randomly selects centroids for initialization to possibly output unstable clustering results. Moreover, random initialization makes the clustering result hard to reproduce. Spectral clustering algorithm is a two-step strategy, which first generates a similarity matrix and then conducts eigenvalue decomposition on the Laplacian matrix of the similarity matrix to obtain the spectral representation. However, the goal of the first step in the spectral clustering algorithm does not guarantee the best clustering result. To address the above issues, this paper proposes an Initialization-Similarity (IS) algorithm which learns the similarity matrix and the new representation in a unified way and fixes initialization using the sum-of-norms regularization to make the clustering more robust. The experimental results on ten real-world benchmark datasets demonstrate that our IS clustering algorithm outperforms the comparison clustering algorithms in terms of three evaluation metrics for clustering algorithm including accuracy (ACC), normalized mutual information (NMI), and Purity.


Introduction
As an unsupervised learning technique, clustering is designed to divide all the samples into subsets with the goal to maximize the intra-subset similarity and inter-subset dissimilarity [32,50,58].Clustering has been widely applied in biology, psychology, marketing, medicine, etc. [5,21,42,46].
Clustering algorithms can be generally classified into two categories: non-graph-based approaches [60] and graph-based approaches [44], based on if the clustering algorithm constructs the similarity matrix.A non-graph-based approach conducts clustering directly on the original data without constructing any graph such as a similarity matrix to measure the similarity among sample points.The examples of non-graph-based algorithms include k-means clustering algorithm [30], locality sensitive hashing based clustering [1] and mean shift [9], etc.A graph-based approach first constructs a graph and then applies the clustering algorithm to partition the graph, including spectral clustering algorithm [35], k + -isomorphism method [39], graph clustering framework based on potential game optimization [7], bag of visual graphs [44], and low-rank kernel learning for graph-based clustering [22], etc.
K-means clustering algorithm is a benchmarked and widely used non-graph-based clustering algorithm due to its simplicity and mathematical tractability [41,61].Specifically, k-means clustering algorithm first conducts initialization via randomly selecting k samples as the k centroids, and then assigns each sample to its nearest centroid according to a similarity measurement (e.g., Euclidean distance).After this, k-means clustering algorithm updates the k centroids followed by assigning each data to a cluster until the algorithm achieves convergence [19].
The result of k-means clustering algorithm depends on the initial guess of centroids.Randomly choosing the cluster centroid may not lead to a fruitful result.It is also hard to reproduce the results.The result of k-means clustering algorithm also depends on the similarity measure.Euclidean distance is often used in k-means clustering algorithm to determine the similarity or calculate the distance between samples.Euclidean distance measures unequally weighted underlying factors but does not account for factors such as cluster sizes, dependent features or density [12,45].K-means clustering algorithm is not good to indistinct or not well-separated datasets [12].
Many literature have solved the initialization problem of k-means clustering algorithm [11,14,25,27,33,42].For example, Duan et al. developed calculating the density to select the initial centroids [14].Lakshmi et al. proposed to use nearest neighbors and feature means to decide the initial centroids [25].Meanwhile, many researches addressed the similarity problem of k-means clustering algorithm [4,34,37,39,40,54].Cosine-Euclidean similarity matrix (CE) employs the cosine similarity of spectral information and classical Euclidean distance to construct a similarity matrix [54].Low-rank representation (LRR) identifies the lowest rank representation among sample points that represent the data samples [29].
However, previous research focused on solving a part of these issues but has not focused on solving the initialization of clustering and the similarity measure in a unified framework.Fixing one of the two issues does not guarantee the best performance.Solving similarity and initialization issues of k-means clustering algorithm simultaneously can be considered as an improvement over the existing algorithms because it could lead to better outputs.So it is significant that our proposed clustering algorithm solves the initialization and the similarity issue simultaneously.
Our proposed Initialization-Similarity (IS) clustering algorithm aims to solve the above two issues in a unified way.Specifically, we set the initialization of the clustering using sum-ofnorms (SON) regularization [28].Moreover, the SON regularization outputs the new representation of the original samples.Our proposed IS clustering algorithm then learns the similarity matrix based on the data distribution.That is, the similarity is high if the distance of the new representation of the data points is small.Furthermore, the derived new representation is used to conduct k-means clustering.Finally, we employ an alternative strategy to solve the proposed objective function.Experimental results on real-world benchmark datasets demonstrate that our IS clustering algorithm outperforms the comparison clustering algorithms in terms of three evaluation metrics for clustering algorithm including accuracy (ACC), normalized mutual information (NMI), and Purity.
We briefly summarize the contributions of our proposed IS clustering algorithm as follows: & The fixed initialization of our IS clustering algorithm using the sum-of-norms regularization makes the clustering robust and reproduced.In contrast, the previous clustering algorithm uses randomly selected centroids initialization to conduct k-means clustering and then outputs unstable or varying clustering results [24].& Previous spectral clustering algorithm uses spectral representation to replace original representation for conducting k-means clustering.To do this, spectral clustering algorithm first generates the similarity matrix and then conducts eigenvalue decomposition on the Laplacian matrix of the similarity matrix to obtain the spectral representation.This is obviously a two-step strategy which the goal of the first step does not guarantee the best clustering result.However, our IS clustering algorithm learns the similarity matrix and the new representation simultaneously.The performance is more promising when the two steps are combined in a unified way.& Our experiment on ten public datasets showed that our proposed IS clustering algorithm outperforms both k-means clustering and spectral clustering algorithms.It implies that simultaneously addressing the two issues of k-means clustering algorithm is feasible and fitter.This section has laid the background of our research inquiry.The remainder of the paper is organized as follows: Section 2 discusses the existing relevant clustering algorithms.Section 3 introduces our IS clustering algorithm.Section 4 discusses the experiments we conducted and presents the results of our experiments.The conclusions, limitations and future research direction are presented in Section 5.

Related work
In this section, we review the relevant clustering algorithms including non-graph-based algorithms and graph-based algorithms.

Non-graph-based algorithms
Non-graph-based algorithms conduct clustering directly on the original data.K-means clustering algorithm is the most famous representative of non-graph-based algorithms.However, k-means clustering algorithm is not suitable for a dataset with an unknown number of clusters.K-means clustering algorithm is also sensitive to the initialization of the centroids [52].Furthermore, the distance measure is very challenging for k-means clustering algorithm [45,59].
Other algorithms based on non-graph-based algorithms include distribution-based algorithms, hierarchy-based algorithms, and density-based algorithms, etc. Popular distribution-based algorithms include Gaussian mixture model (GMM) [38] and distribution based clustering of large spatial databases (DBCLASD) [53], etc.The distribution-based algorithms assume that the data generated from the same distribution belongs to the same cluster.However, not all the sample has several distributions and the parameters have a strong impact on the clustering results [52].Hierarchy-based algorithms include robust clustering using links (ROCK) [17] and clustering using representatives (CURE) [18], etc.The hierarchy-based algorithms build a hierarchical relationship among samples to conduct clustering.The hierarchy-based algorithms also need to predefine the number of clusters.Density-based algorithms include Mean-shift [9] and ordering points to identify the clustering structure (OPTICS) [2].The densitybased algorithms are based on the assumption that the samples in the high-density region belong to the same cluster.However, the results of density-based algorithms would not be good if the density of samples is not even.Moreover, density-based algorithms are also sensitive to the parameters [52].

Graph-based algorithms
Instead of conducting clustering directly on the original samples, most graph-based clustering algorithms will first construct a graph and then apply a clustering algorithm to partition the graph.A node of the graph represents a sample and the edge represents the relationship among the samples.Graph representation represents the high-order relationship among samples which is easier to interpret the complex relationship inherent in the samples than to interpret it from the original samples directly.Spectral clustering algorithm is a typical example of graph-based algorithms.In the literature of graph-based algorithms, Cosine-Euclidean algorithm employs the cosine similarity of spectral information and classical Euclidean distance to construct a similarity matrix [54].With the assumption that pairwise similarity values between elements are normally distributed and tight groups of highly similar elements likely belong to the same cluster, connectivity kernels (CLICK) algorithm recursively partitions a weighted graph into components using minimum cut computations [43].Some graph-based algorithms construct hypergraph to represent a set of spatial data [8,15], while other graph-based algorithms construct coefficient vectors of two samples to analyze the similarity between two samples [51].For example, Low-Rank Representation (LRR) identifies the subspace structures from samples and then finds the lowest rank representation among samples to represent the samples [29].Least Squares Regression (LSR) exploits data correlation and encourages a grouping effect for subspace segmentation [31].Smooth representation (SMR) model introduces the enforced grouping effect conditions, which explicitly enforce in the sample selfrepresentation model [20].Chameleon uses a graph partitioning algorithm to cluster the samples into several relatively small sub-clusters, and then finds the genuine clusters by repeatedly combining these sub-clusters [23].
Graph-based clustering algorithms improve previous non-graph-based clustering algorithms on the representation of original samples.However, current graph-based clustering algorithms use a two-stage strategy which learns the similarity matrix and the spectral representation separately.The first stage goal of learning a similarity matrix does not always match the second stage goal of achieving optimal spectral representation, and thus not guaranteed to always outperform non-graph-based clustering algorithms.Moreover, most graph-based clustering algorithms still use non-graph-based clustering algorithms in the final stage and thus do not solve the initialization issue of non-graph-based clustering algorithms.
3 Proposed algorithm

Symbols
Given a data matrix X = {x 1; x 2; … ; x n } ∈ ℝ n × d , where n and d, respectively, are the number of samples and features, we denote boldface uppercase letters, boldface lowercase letters, and italic letters as matrices, vectors, and scalars, respectively, and also summarize the symbols used in this paper in Table 1.

K-means clustering algorithm
K-means clustering algorithm is one of the most famous examples of the non-graph-based algorithm due to its simplicity.K-means algorithm aims at minimizing the total intra-cluster variance represented by an objective function known as the squared error function shown in Eq. (1).
Where C i is the number of sample points in the i-th cluster.k is the number of clusters, while h j is the j-th centroid.‖x i − h j ‖ is the Euclidean distance between x i and h j .K-means clustering algorithm can be reformulated as the formulation of nonnegative matrix factorization as follows [48]: Based on both Eq.(1) and Eq. ( 2), it is obvious that different initialization methods may have different effects on the clustering results [36,55].This implies that it is difficult to reproduce the clustering results.Moreover, Eq. ( 2) also shows that the outcome of the clustering objective function only depends on Euclidean distance between the sample and the centroid, while Euclidean distance does not reveal other underlying factors such as cluster sizes, shape, dependent features or density [12,45].Thus the similarity measurement is an issue of k-means algorithm (Table 2).
To address the issue of k-means algorithm similarity measurement, spectral clustering algorithm uses spectral representation to replace original representation.To achieve this, spectral clustering algorithm first builds a similarity matrix and conducts eigenvalue The i-th row of X x i, j The element in the i-th row and j-th column of X ‖X‖ 2 l 2 norm of X ‖X‖ F The Frobenius norm or the Euclidean norm of X X T The transpose of X decomposition on its Laplacian matrix to obtain the reduced spectral representation.The pseudo code for spectral clustering algorithm is shown in Table 3.
Obviously, spectral clustering algorithm replacing original representation with spectral representation deals the issue of similarity measurement in k-means clustering algorithm.However, spectral clustering algorithm separately learns the similarity matrix and the spectral representation, as knowns as a two-stage strategy, where the goal of constructing the similarity matrix in the first stage does not aim at achieving optimal spectral representation, and thus not guaranteeing to always outperform k-means clustering algorithm.

Initialization-similarity clustering algorithm
This paper proposes a new clustering algorithm (i.e., Initialization-Similarity (IS)) to simultaneously solve the initialization issue of k-means clustering algorithm and the similarity issue of spectral clustering algorithm in a unified framework.Specifically, IS clustering algorithm uses the sum-of-norms regularization to investigate the initialization issue, and jointly learns the similarity matrix and the spectral representation to overcome the issue of the two-stage strategy of spectral clustering algorithm.To achieve our goal, we form the objective function of the IS clustering algorithm as follows: Where S ∈ ℝ n × n is the similarity matrix to measure the similarity among data points, and ) is an implicit function, as known as robust loss function, which has been used for avoiding the effect of noise and automatically generating cluster number in robust statistics.Eq. ( 3) aims at learning the new representation U and fixes the initialization of clustering.Moreover, Eq. (3) learns the new representation U as well as considers the similarity among Table 2 The pseudo code for k-means clustering algorithm [19] Input: X (data matrix), k (the number of clusters) Output: k centroid and the cluster indicator of each data point Initialization: 1. Assign each sample x i to nearest cluster j using Euclidian distance; 2. Recalculating the new centroid h 1 , h 2 … h k ; Until convergence (the cluster indicator of each data points unchanged); sample points, i.e., the higher the similarity s i, j between two samples, the smaller their corresponding new representation (u i and u j ) is.Furthermore, we learn the similarity matrix S based on the sample distribution, i.e., iteratively updated by the updated U.This makes the new representation reasonable.
A number of robust loss functions have been proposed for avoiding the influence of noises and outliers in robust statistics [3,56].In this paper, we employ the Geman-McClure function [16] as follows Eq. ( 4) is often used to measure how good a prediction model does in terms of being able to predict the expected outcome.The closer the distance is, the smaller value of ‖u p − u q ‖ 2 is, and the higher the similarity s p, q is.With the update of other parameters in Eq. ( 3), the distance ‖u p − u q ‖ 2 for some p, q, will be very close, or even u p = u q .In this case, the clustering number will be less than n.In this way, the clusters will be determined.
In robust statistics, the optimization of the robust loss function is usually difficult or inefficient.To address this, it is normal for introducing an auxiliary variable f i,j and a penalty item φ(f i,j ) [6,26,57], and thus Eq. ( 3) is equivalent to: Where φ f i; j ¼μ ffiffiffiffiffiffi ffi f i; j p −1 2 ;i; j ¼ 1…n Algorithm 1.The pseudo code for IS clustering algorithm.

Optimization
Eq. ( 5) is not jointly convex on F, U, and S, but is convex on each variable while fixing the rest.To solve the Eq. ( 5), the alternative optimization strategy is applied.We optimize each variable while fixing the rest until our algorithm converges.The pseudo-code of our IS clustering algorithm is given in Algorithm 1.
(i) Update F while fixing S and U.While S and U are fixed, the objective function can be rewritten in a simplified matrix form to optimize F: Since the optimization of f i, j is independent of the optimization of other f p, q , i ≠ p, j ≠ q, the f i, j is optimized first as shown in following By conducting a derivative on Eq. ( 7) with respect to f i, j , we get (ii) Update S while fixing U and F. While fixing U and F, the objective function Eq. ( 5) with respect to S is: Since the optimization of s i is independent of the optimization of other s j , i ≠ j, i, j = 1, … , n, the s i is optimized first as shown in following: , Eq. ( 10) is equivalent to: min ; s:t:; ∀i; s i; j ≥0; According to Karush-Kuhn-Tucker (KKT) [47], the optimal solution s i should be (iii) Update U while fixing S and While S and F are fixed, the objective function can be rewritten in a simplified form to optimize U: Let h i, j = s i, j f i, j .Eq. ( 13) is equivalent to: min After conducting a derivative on Eq. ( 14) with respect to U, we get Eq. ( 15) is solved to find U:

Convergence analysis
In this section, we prove the of our proposed IS clustering algorithm in order to prove our proposed algorithm can reach at least a locally optimal solution, so we use Theorem 1.
Theorem 1 IS clustering algorithm decreases the objective function value of Eq. ( 5) until it converges.
Proof By denoting F (t) , S (t) , and U (t) , respectively, are the results of the t-th iteration of F, S, and U, we further denote the objective function value of Eq. ( 5) in the t-th iteration as . According to Eq. ( 8) in Section 3.4, F has a closed-form solution, thus we have the following inequality: According to Eq. ( 12) in Section 3.4, S has a closed-form solution, thus we have the following inequality: According to Eq. ( 16) in Section 3.4, U has a closed-form solution, thus we have the following inequality: Finally, based on above three inequalities, we get Eq. (19) indicates that the objective function value in Eq. ( 5) decreases after each iteration of Algorithm 1.This concludes the proof of Theorem 1.

Experiments
In this section, we evaluated performance of our proposed Initialization-Similarity (IS) algorithm, by comparing it with two benchmark algorithms on ten real UCI datasets, in terms of evaluation metrics.

Experiment setting
Dataset We used ten UCI datasets in our experiments, including the standard datasets for handwritten digit recognition, face datasets, and datasets, etc.We summarized them in Table 4.
Comparison algorithms Two comparison algorithms are classical clustering algorithms and their details were summarized below.
& K-means clustering algorithm (re)assigns samples to their nearest centroid and recalculates centroids iteratively with a goal to minimize the sum of distances between samples and centroid.& Spectral clustering algorithm first forms the similarity matrix, and then calculates the first k eigenvectors of its Laplacian matrix to define feature vectors.Finally, it runs k-means clustering on these features to separate objects into k classes.There are different ways to calculate the Laplacian matrix.Instead of using simple Laplacian, we used normalized Laplacian L = D × L × D, which have better performance than using simple Laplacian [10].
For the above two algorithms, k-means clustering conducts clustering directly on the original data while spectral clustering is a two-stage based strategy, which constructs a graph first and then applies k-means clustering algorithm to partition the graph.
Experiment set-up In our experiments, firstly, we tested the robustness of our proposed IS clustering algorithm by comparing it with k-means clustering and spectral clustering algorithms using real datasets in terms of three evaluation metrics widely used for clustering research.Due to the sensitivity of k-means clustering to its initial centroids, we ran k-means clustering and spectral clustering algorithms 20 times and chose the average value as the final result.Secondly, we investigated the parameters' sensitivity of our proposed IS clustering algorithm (i.e.α and β in Eq. ( 5)) via varying their values to observe the variations of clustering performance.Thirdly, we demonstrated the convergence of Algorithm 1 to solve our proposed objective function Eq. ( 5) via checking the iteration times when Algorithm 1 converges.
Evaluation measures To compare our IS clustering algorithm with related algorithms, we adopted three popular evaluation metrics of clustering algorithms including accuracy (ACC), normalized mutual information (NMI), and Purity [49].ACC measures the percentage of samples correctly clustered.NMI measures the pairwise similarity between two partitions.Purity measures the percentage of each cluster containing the correctly clustered samples [13,61].The definitions of these three evaluation metrics are given below.
where N correct represents the number of correct clustered samples, and N represents total number of samples.
where A, B represents two partitions of n samples into C A and C B clusters respectively.
where k represents number of clusters and n represents total number of samples.S i represents the number of samples in the i-th cluster.P i represents the distribution of correctly clustered sample.

Experimental results
We listed the clustering performance of all algorithms in First, one-step clustering algorithm, e.g.our IS algorithm, performed better than two-step clustering algorithms, e.g.spectral clustering algorithm.The reason could be that the goals of the similarity matrix learning and the new representation are the optimal clustering results, whereas the two-step clustering algorithm achieves sub-optimal results.
Second, both one-step clustering algorithm, e.g.our IS clustering algorithm and two-step clustering algorithm, e.g.spectral clustering algorithm outperformed k-means clustering algorithm.This implied that constructing the graph or learning a new representation of original samples improved the clustering performance.First, different datasets needed different ranges of parameters to achieve the best performance.For example, IS clustering algorithm achieved the best ACC (97%), NMI (91%) and Purity (97%) on dataset Wireless when both parameters α and β were 10.But for the dataset Digital, IS clustering algorithm achieved the best ACC (80%), NMI (78%) and Purity (81%) when β = 100 and α =0.1.This indicated that our IS clustering algorithm was data-driven.Fourth, even our IS clustering algorithm was not very sensitive on parameters α and β, the algorithm was slightly more sensitive on parameter α than it was on the parameter β.
Convergence Figure 4 showed the trend of objective values generated by our proposed algorithm 1 with respect to iterations.From Fig. 4, we can see that our algorithm 1 monotonically decreased the objective function value until it converged, when applying it to optimize the proposed objective function in Eq. (5).It is worth noting that the convergence rate of our algorithm 1 was relatively fast, converging to the optimal value within 20 iterations on all the datasets used.

Table 3
The pseudo code for simple spectral clustering algorithm Input: X∈ℝ n × d (data matrix), k (the number of clusters) Output: k centroid and the cluster indicator of each data point • Computing S ∈ ℝ n × n to measure the similarity between any data point pair; • Computing L = D -S, where D = [d ij ] n × n is a diagonal matrix and d ij ¼ ∑ n j¼1 s ij ; • Generating spectral representation using the eigenvectors and eigenvalues of L; • Conducting k-means clustering on the spectral representation; Multimedia Tools and Applications (2019) 78:33279-33296

Parameters' sensitivity
We varied parameters α and β in the range of [10 −2 , …10 2 ], and recorded the values of ACC, NMI and Purity of ten datasets clustering results for our IS clustering algorithm in Figs. 1, 2 and 3.

Fig. 1
Fig. 1 ACC of our IS clustering algorithm with respect to different parameter settings

Fig. 2
Fig. 2 NMI of our IS clustering algorithm with respect to different parameter settings

Fig. 3
Fig. 3 Purity of our IS clustering algorithm with respect to different parameter settings

Table 1
Description of symbols used in this paper

Table 4
Description of ten benchmark datasets

Table 5 ,
which showed that out IS clustering algorithm achieved the best performance on all ten datasets in terms of ACC and NMI, as well as outperformed k-means clustering algorithm on all ten datasets in terms of Purity.Our IS clustering algorithm outperformed spectral clustering algorithm on all eight datasets in terms of Purity but performed slightly worse than spectral clustering algorithm on three datasets USPT, USPST and Yale.The difference in Purity results between our IS clustering algorithm and the spectral clustering algorithm was only 1%.More specifically, our IS clustering algorithm increased ACC by 6.3% compared to k-means clustering algorithm and 3.3% compared to spectral clustering algorithm.Our IS clustering algorithm increased NMI by 4.6% compared to k-means clustering algorithm and 4.5% compared to spectral clustering algorithm.Our IS clustering algorithm increased Purity by 4.9% compared to kmeans clustering algorithm and 2.9% compared to spectral clustering algorithm.Other observations were listed in the following sections.

Table 5
Performance of all algorithms on ten benchmark datasets