Cluster Analysis
Cluster analysis is the unsupervised cousin of the supervised classification methods. Both produce groupings of cases; the difference is that supervised methods are trained to recover a known outcome, while cluster analysis discovers groupings from the data structure alone without reference to an external criterion. This makes cluster analysis useful for exploration and structure-discovery, and unreliable for confirmatory inference — a property that gets routinely violated in applied research, where cluster solutions are reported as if they were established facts about the population rather than artifacts of the chosen method and sample.
The example most readers have seen is the workplace-personality archetype. A consultancy administers a 60-item personality inventory to a sample of professionals, runs k-means clustering on the response data, picks k = 4 because the elbow looks roughly there, and reports “four workplace personality types: the Driver, the Collaborator, the Analyst, the Caregiver.” The four-type structure looks compelling. It usually doesn’t replicate cleanly across samples, doesn’t survive sensitivity testing on the clustering method, and doesn’t predict outcomes better than continuous personality scores. The clusters are a story the data can be made to tell rather than a structural property of the data.
What Cluster Analysis Actually Does
The technique takes a set of cases (people, observations) described on multiple variables (questionnaire items, behavioral measures) and groups them based on a similarity or distance metric. The output is a cluster assignment for each case and a description of the cluster centers (typically the mean values on each variable within each cluster).
The main algorithm families are:
Hierarchical clustering. Builds a tree of nested clusterings, either bottom-up (agglomerative — each case starts in its own cluster, similar clusters get merged) or top-down (divisive — all cases start in one cluster, splits happen recursively). The output is a dendrogram showing the merging structure, from which the analyst selects a number of clusters by choosing where to cut the tree. Hierarchical clustering is interpretable but computationally expensive for large samples and sensitive to the linkage criterion (Ward’s, complete, single).
Partitioning methods. Direct partition of cases into a pre-specified number of clusters. K-means is the canonical example — pick k, assign each case to the nearest of k randomly-initialized centroids, recompute centroids, repeat until convergence. K-means is fast but assumes spherical clusters of similar size and is sensitive to initialization (different random starts can produce different final clusterings).
Model-based clustering. Treat the data as coming from a mixture of probability distributions (typically Gaussian) and use the EM algorithm to estimate the parameters. The output is a probabilistic cluster membership (each case has a probability of belonging to each cluster) rather than a hard assignment. Model-based clustering is more flexible than k-means and includes formal model-selection criteria (BIC, AIC) for choosing the number of clusters, but is more computationally demanding and harder to interpret.
Density-based methods. DBSCAN and related algorithms group cases that are densely packed in feature space and leave low-density cases as “noise.” Density-based methods can find clusters of arbitrary shape and don’t require pre-specifying the number of clusters, but require tuning of density parameters and don’t work well in high dimensions.
Each algorithm makes different assumptions about cluster shape, size, and density. Different methods applied to the same data routinely produce different clusterings, and the analyst’s choice of method is usually under-justified in applied reports.
Choosing the Number of Clusters
The number-of-clusters problem is the hardest part of applied cluster analysis. The data don’t typically reveal a “true” number; the methods either require k as input (partitioning, model-based) or require a cut point (hierarchical), and the choice is left to the analyst.
The standard diagnostics:
Elbow plot. Plot a within-cluster-sum-of-squares-style measure against the number of clusters; pick the k where the curve “elbows” (the marginal improvement from adding more clusters drops). The elbow is rarely sharp in real data, and reasonable analysts looking at the same plot often disagree about where the elbow is.
Silhouette scores. Compute, for each case, the difference between its average distance to its own cluster and its average distance to the nearest other cluster, normalized to the range -1 to 1. The mean silhouette score over all cases gives a measure of cluster separation. Higher is better; the k that maximizes the mean silhouette is one choice criterion, but silhouettes can be uninformative when the underlying structure is genuinely continuous rather than discrete.
Gap statistic. Compare the within-cluster-sum-of-squares for the data against the within-cluster-sum-of-squares expected under a reference null distribution (typically uniform). The k where the gap is largest is the chosen number. Gap statistics are sensitive to the reference distribution choice and computationally expensive.
Stability analysis. Run the clustering on bootstrap samples or with different random initializations and measure how consistent the cluster assignments are. High stability suggests the solution is reproducible; low stability suggests the cluster structure is an artifact of the specific run.
Model-selection criteria. For model-based clustering, BIC or AIC provide formal criteria for selecting the number of mixture components. These are the most principled criteria but only apply to model-based methods and assume the model family is correctly specified.
A common pattern in applied work is to report a single criterion (usually the elbow) and present the resulting k as “the” solution. The honest version reports multiple criteria, the inconsistency between them, and the rationale for the chosen k given the inconsistency.
When the Clusters Aren’t Real
Cluster analysis will always produce clusters. The algorithm is constructed to partition the data, and partitioning happens whether or not there is real cluster structure in the underlying population. This is the central problem with cluster-based reporting: the existence of clusters in the output does not establish the existence of clusters in the population.
Several diagnostic patterns suggest the clusters are an artifact:
Cluster instability across samples. The same analysis run on a different sample from the same population produces different clusterings. Bootstrapped resampling that produces highly variable cluster assignments suggests the structure isn’t replicable.
Cluster instability across methods. K-means produces one solution, hierarchical produces a different one, model-based produces a third. When the solutions don’t converge, the structure is method-dependent rather than data-driven.
Continuous within-cluster gradients. Cluster centers are well-separated, but the cases within each cluster are spread along the same dimensions that separate the centers. This is the signature of an underlying continuous structure that has been forced into discrete bins by the clustering algorithm.
Equal-sized clusters when there shouldn’t be. Many algorithms (k-means especially) have biases toward producing clusters of similar size, even when the underlying density is non-uniform. If the clusters all come out at 22-28% of the sample and the construct theory doesn’t predict that, the equal sizes are probably an algorithmic artifact.
Failure to predict external criteria. Cluster membership doesn’t predict outcomes better than the continuous variables used to derive the clusters. If the clusters are real, they should carry incremental predictive information beyond the underlying scores. If they don’t, the clustering is descriptive only and shouldn’t be used for inference.
The Continuous-versus-Categorical Question
The deeper problem with much applied cluster analysis is that the underlying construct is continuous, not categorical. Personality traits, attitudes, engagement levels, leadership styles — most psychological constructs distribute continuously in the population, and the cluster solutions impose categorical structure where the data structure is dimensional.
The cost is that the cluster-based reporting loses information. Two cases assigned to the same cluster can have meaningfully different positions on the underlying continuous dimensions; the cluster label flattens them into the same description. Two cases assigned to different clusters can be closer to each other on the underlying dimensions than to most other members of their own clusters; the cluster boundary creates an apparent distinction that isn’t there.
The defensible cases for clustering continuous data are:
- The use case requires discrete groupings (segment-based marketing, course-track placement, treatment-group assignment).
- The clusters are validated externally — they predict relevant outcomes that the continuous scores don’t, or they correspond to theoretically meaningful subtypes with independent evidence.
- The clustering is presented as exploratory structure-discovery, with appropriate caveats about replication and method-dependence.
The indefensible cases — and the most common ones in applied work — are reports that present cluster solutions as if they were structural properties of the population, when the underlying construct is continuous, the cluster boundaries are imposed by the algorithm, and the cluster labels are post-hoc descriptions rather than predictively useful categories.
What Good Cluster Analysis Looks Like
The defensible applied use of cluster analysis includes:
- Multiple algorithms. Hierarchical, partitioning, and model-based methods applied to the same data, with consistency across methods reported as evidence of stable structure.
- Multiple k values. Diagnostics for several plausible numbers of clusters, with explicit reasoning for the chosen value and acknowledgment of nearby alternatives.
- Stability checks. Bootstrap resampling or split-sample validation showing that the cluster solution is replicable.
- External validation. Tests that cluster membership predicts outcomes the underlying variables don’t predict equally well — establishing that the clustering carries incremental information.
- Honest reporting of fit. Within-cluster internal consistency statistics, silhouette distributions, and assignment confidence (for model-based methods) reported alongside the cluster descriptions.
A cluster analysis with all of these properties is rare in applied reports. A cluster analysis with none of them is common. The difference between the two is usually invisible to the audience consuming the report, which is one reason cluster-based archetypes and personality-type frameworks dominate the popular HR-content landscape despite weak underlying psychometric evidence.
When to Use Something Else
For most applied psychometric problems, the alternatives to clustering perform better:
- Continuous scoring with profile interpretation. Report the continuous scores on each dimension and let the user interpret patterns, rather than imposing cluster categories.
- Factor analysis and dimension reduction. When the question is “what are the underlying dimensions,” factor analysis is the appropriate method; when the question is “what are the discrete subtypes,” cluster analysis is, but only after factor analysis has established what dimensions exist.
- Item response theory ability estimation. When the data are item responses and the underlying construct is continuous, IRT provides better estimates than cluster analysis can.
- Latent class analysis. A model-based method specifically designed for discrete categorical latent structure with categorical observed variables. More appropriate than k-means for many applied psychometric clustering problems, with more principled model-selection criteria.
The right method depends on the structure of the underlying construct. Cluster analysis is appropriate when the construct is genuinely categorical (clinical subtypes, market segments with discrete value structures) and inappropriate when the construct is continuous and the clustering would impose artificial boundaries on dimensional data. In the work I’ve done at Gyfted, clustering is used sparingly and almost always in combination with continuous-score reporting — the cluster narrative is communicatively useful, but the underlying decisions rest on the continuous variables that the clustering summarizes.