Non-Hierarchical Cluster Analysis
Non-Hierarchical Cluster Analysis Non-hierarchical cluster analysis (often known as K-means Clustering Method) forms a grouping of a set of units, into a pre-determined number of groups, using an iterative algorithm that optimizes a chosen criterion. Starting from an initial classification, units are transferred from one group to another or swapped with units from other groups, until no further improvement can be made to the criterion value. There is no guarantee that the solution thus obtained will be globally optimal – by starting from a different initial classification it is sometimes possible to obtain a better classification.
However, starting from a good initial classification much increases the chances of producing an optimal or near-optimal solution. (source: http://www. asreml. com/products/genstat/mva/NonHierarchicalClusterAnalysis. htm) The algorithm is called k-means, where k is the number of clusters you want; since a case is assigned to the cluster for which its distance to the cluster mean is the smallest. The action in the algorithm centers on finding the k-means. You start out with an initial set of means and classify cases based on their distances to the centers.
Next, you compute the cluster means again, using the cases that are assigned to the cluster; then, you reclassify all cases based on the new set of means. You keep repeating this step until cluster means don’t change much between successive steps. Finally, you calculate the means of the clusters once again and assign the cases to their permanent clusters. (source: http://www. norusis. com/pdf/SPC_v13. pdf) Steps in Non-Hierarchical Cluster Analyisis In this method, the desired number of clusters is specified in advance and the ’best’ solution is chosen.
The steps in such a method are as follows: 1. Choose initial cluster centers (essentially this is a set of observations that are far apart — each subject forms a cluster of one and its center is the value of the variables for that subject). 2. Assign each subject to its “nearest” cluster, defined in terms of the distance to the centroid. 3. Find the centroids of the clusters that have been formed 4. Re-calculate the distance from each subject to each centroid and move observations that are not in the cluster that they are closest to. 5. Continue until the centroids remain relatively stable.
Non-hierarchical cluster analysis tends to be used when large data sets are involved. It is sometimes preferred because it allows subjects to move from one cluster to another (this is not possible in hierarchical cluster analysis where a subject, once assigned, cannot move to a different cluster). Two disadvantages of non-hierarchical cluster analysis are: (1) it is often difficult to know how many clusters you are likely to have and therefore the analysis may have to be repeated several times and (2) it can be very sensitive to the choice of initial cluster centers.
Again, it may be worth trying different ones to see what impact this has. One possible strategy to adopt is to use a hierarchical approach initially to determine how many clusters there are in the data and then to use the cluster centers obtained from this as initial cluster centers in the non-hierarchical method. (source: http://mlsc. lboro. ac. uk/resources/statistics/Clusteranalysis. pdf) Determining the Number of Clusters in a Data Set A quantity often labeled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.
The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i. e. , when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster.
If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen from other multivariate analysis techniques. (source: http://en. wikipedia. org/wiki/K-means_clustering) Features of Non-Hierarchical Cluster Analysis Here are some features that distinguish the Non-Hierarchical with the Hierarchical Cluster Analysis: 1. At times, there is an interpretive advantage to non-hierarchical clusters. For example, assume that if the data are divided into three clusters, units A and B will be in the same cluster.
It may often make sense that if the data are divided into say two clusters, A and B will be in different clusters. This result is impossible with a hierarchical method. 2. With hierarchical methods, execution time and necessary storage generally increase linearly with the square of the number of objects being clustered. With k-means, execution time goes up linearly with the product of the number of units, the number of variables, the maximum number of clusters desired (usually much less than the number of units) and the (unpredictable) number of iterations.
Storage increases linearly with the product of the number of objects and variables. Thus, it is possible to use k-means to cluster much larger numbers of objects. 3. K-means provides a method whereby the degree of clustering in the data can be evaluated. It is a property of most cluster analysis techniques (including k-means) that the desired number of clusters will be formed whether or not the data are clustered in any practical sense. K-means allows one to compare the degree of clustering observed in the actual data with clustering observed with comparable randomized data. . However, in distinction to the hierarchical methods, which are guaranteed to find the best solution, given the clustering criteria, k-means does not necessarily find the optimal solution for a given level of clustering. (It is like multidimensional scaling in this respect. ) (source: http://pdf-world. net/pdf/110374/KMEANS-Nonhierarchical-Cluster-Analysis-pdf. php) Summary • Non-Hierarchical or K-mean Clustering is the portioning of data into a user specified number of clusters and then iteratively re-assigning observations to clusters until some numerical criterion is met. This is applicable for large number of cases, which can be time-consuming if applied to the tree-like construction process of the Hierarchical Method. • This method is very dependent on the ability of the researcher to select the centroid in a practical, objective or theoretical manner. Sample Problem (from IBM SPSS Statistics 19 Statistical Procedures Companion by Marija Norusis using SPSS) Roman Pottery: The Example When you are studying old objects in hopes of determining their historical origins, you look for similarities and differences between the objects in as many dimensions as possible.
If the objects are intact, you can compare styles, colors, and other easily visible characteristics. You can also use sophisticated devices such as high-energy beam lines or spectrophotometers to study the chemical composition of the materials from which they are made. Hand (1994) reports data on the percentage of oxides of five metals for 26 samples of Romano-British pottery described by Tubb et al. (1980). The five metals are aluminum (Al), iron (Fe), magnesium (Mg), calcium (Ca), and sodium (Na).
In this section, you’ll study whether the samples from distinct clusters and whether these clusters correspond to areas where the pottery was excavated. Before You Start Whenever you use a statistical procedure that calculates distances, you have to worry about the impact of the different units in which variables are measured. Variables that have large values will have a large impact on the distance compared to variables that have smaller values. In this example, the average percentages of the oxides differ quite a bit, so it’s a good idea to standardize the variables to a mean of 0 and a standard deviation of 1. Standardized variables are used in the example. ) You also have to specify the number of clusters (k) that you want produced. Tip: If you have a large data file, you can take a random sample of the data and try to determine a good number, or range of numbers, for a cluster solution based on the hierarchical clustering procedure. You can also use hierarchical cluster analysis to estimate starting values for the k-means algorithm. Initial Cluster Centers The first step in k-means clustering is finding the k centers. This is done iteratively.
You start with an initial set of centers and then modify them until the change between two iterations is small enough. If you have good guesses for the centers, you can use those as initial starting points; otherwise, you can let SPSS find k cases that are wellseparated and use these values as initial cluster centers. Figure 16-8 shows the initial centers for the pottery example. [pic] Warning: K-means clustering is very sensitive to outliers, since they will usually be selected as initial cluster centers. This will result in outliers forming clusters with small numbers of cases.
Before you start a cluster analysis, screen the data for outliers and remove them from the initial analysis. The solution may also depend on the order of the cases in the file. After the initial cluster centers have been selected, each case is assigned to the closest cluster, based on its distance from the cluster centers. After all of the cases have been assigned to clusters, the cluster centers are recomputed, based on all of the cases in the cluster. Case assignment is done again, using these updated cluster centers.
You keep assigning cases and recomputing the cluster centers until no cluster center changes appreciably or the maximum number of iterations (10 by default) is reached. From Figure 16-9, you see that three iterations were enough for the pottery data. [pic] Tip: You can update the cluster centers after each case is classified, instead of after all cases are classified, if you select the Use Running Means check box in the Iterate dialog box. Final Cluster Centers After iteration stops, all cases are assigned to clusters, based on the last set of cluster centers.
After all of the cases are clustered, the cluster centers are computed one last time. Using the final cluster centers, you can describe the clusters. In Figure 16-10, you see that cluster 1 has an average sodium percentage that is much higher than the other clusters. Cluster 2 has higher-than-average values for calcium, iron, and magnesium, an average value for sodium, and a smaller-than-average value for aluminum. Cluster 3 has below-average values for all of the minerals except aluminum. [pic] Differences between Clusters You can compute F ratios that describe the differences between the clusters.
As the footnote in Figure 16-11 warns, the observed significance levels should not be interpreted in the usual fashion because the clusters have been selected to maximize the differences between clusters. The point of Figure 16-11 is to give you a handle on the differences for each of the variables among the clusters. If the observed significance level for a variable is large, you can be pretty sure that the variable doesn’t contribute much to the separation of the clusters. [pic] You see in Figure 16-12 that 2 cases are assigned to the first cluster, 14 to the second, and 10 to the third cluster.
You don’t like to see clusters with very few cases unless they are really different from the remaining cases. For each case, you can save the cluster to which it is assigned, as well as the distance to its cluster center. [pic] If you plot the distances to their cluster centers for all of the cases, as in Figure 16-13, you can see if there are cases that are outliers. Clusters 2 and 3 are unremarkable. Because cluster 1 has only two cases, you see only one point on the plot. (The cluster center is halfway between the two cases, o their distances from it are equal and are plotted at the same location. ) [pic] You can decrease the number of clusters to be used to two and that would most likely eliminate the two-case cluster at the expense of making the clusters more heterogeneous. Or you could increase the number of clusters and see how the solution changes. In this example, increasing the number of clusters causes clusters 2 and 3 to split, while cluster 1 remains with the two cases that are characterized by very high levels of sodium. Locations of the Pottery The pottery in the data set was found in one of four locations.
To see whether pottery found at the same site is homogeneous with respect to metallic composition, you can cross tabulate the site where a pot was found and the cluster to which it was assigned, as shown in Figure 16-14. [pic] You see that the anomalous pottery in cluster 1 was found at the first site. Most of the cases in cluster 2 were also from the first site, although two of them were from the second. Pottery from the third and fourth sites was all in cluster 3. It looks like there is a relationship between the metallic composition of a piece of pottery and the site where it was excavated.