Fatskills
Practice. Master. Repeat.
Study Guide: Intro to Marketing Research: Cluster Analysis Non-Hierarchical Methods K-Means Choosing Number of Clusters Elbow Method Silhouette Score
Source: https://www.fatskills.com/marketing-management/chapter/marketing-research-mktresearch-cluster-analysis-non-hierarchical-methods-k-means-choosing-number-of-clusters-elbow-method-silhouette-score

Intro to Marketing Research: Cluster Analysis Non-Hierarchical Methods K-Means Choosing Number of Clusters Elbow Method Silhouette Score

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~4 min read

What It Is

Non-hierarchical methods, specifically K-Means clustering, are used in marketing research to segment customers based on their characteristics without assuming a pre-existing hierarchical structure. A famous example is the study by Wedel and Kamakura (2000) on customer segmentation in the banking industry. They used K-Means clustering to identify distinct customer segments based on their demographic and behavioral characteristics, which helped banks tailor their marketing strategies to each segment. This matters for marketing decision-making as it enables businesses to target specific customer groups more effectively.

Key Terms & Concepts

  • K-Means Clustering: An unsupervised machine learning algorithm that groups similar data points into clusters based on their characteristics.
    • Example: Wedel and Kamakura (2000) used K-Means clustering to segment customers in the banking industry.
  • Number of Clusters (K): The number of clusters to be identified in the data.
    • Formula: K = √(n / 2), where n is the sample size.
    • Example: If a sample size is 100, K would be approximately 7.07, which is rounded to 7 or 8 clusters.
  • Elbow Method: A technique used to determine the optimal number of clusters (K) by plotting the within-cluster sum of squares (WCSS) against K.
    • Example: A plot of WCSS against K shows an "elbow" point where the rate of decrease in WCSS slows down, indicating the optimal K.
  • Silhouette Score: A measure of how well each data point fits into its assigned cluster.
    • Formula: Silhouette Score = (b - a) / max(a, b), where a is the average distance to other points in the same cluster and b is the average distance to points in other clusters.
    • Example: A high silhouette score (close to 1) indicates that a data point is well-clustered.
  • Centroid: The mean value of all data points in a cluster.
    • Example: The centroid of a cluster of customer demographic data would be the mean age, income, and education level of all customers in that cluster.
  • Distance Metric: A measure of the similarity between two data points.
    • Example: Euclidean distance is a common distance metric used in K-Means clustering.
  • Initialization: The process of randomly selecting initial centroids for the clusters.
    • Example: Random initialization can lead to different cluster assignments, so it's essential to run K-Means multiple times with different initializations.
  • Convergence: The process of K-Means clustering stopping when the centroids no longer change.
    • Example: Convergence is typically achieved when the change in centroids is less than a certain threshold (e.g., 0.01).
  • Within-Cluster Sum of Squares (WCSS): A measure of the sum of squared distances between each data point and its assigned centroid.
    • Formula: WCSS = ∑(x_i - μ)^2, where x_i is a data point and μ is the centroid.
    • Example: A lower WCSS indicates that the data points are closer to their centroids.
  • Between-Cluster Sum of Squares (BCSS): A measure of the sum of squared distances between each centroid and the overall mean.
    • Formula: BCSS = ∑(μ - μ_mean)^2, where μ is a centroid and μ_mean is the overall mean.
    • Example: A higher BCSS indicates that the centroids are farther apart from the overall mean.

Common Misunderstandings

  • Misunderstanding: K-Means clustering always produces the same results.
  • Correction: K-Means clustering can produce different results due to random initialization, so it's essential to run the algorithm multiple times with different initializations.
  • Misunderstanding: The number of clusters (K) is always equal to the number of distinct customer segments.
  • Correction: K-Means clustering can identify clusters that are not necessarily distinct customer segments, and the number of clusters (K) may not always match the number of segments.
  • Misunderstanding: The Silhouette Score is a measure of cluster quality.
  • Correction: The Silhouette Score is a measure of how well each data point fits into its assigned cluster, not a measure of cluster quality.

Quick Application / Identification

Scenario: A marketing manager wants to segment customers based on their demographic characteristics (age, income, education level) using K-Means clustering. The data consists of 100 customers with the following characteristics:


Age Income Education Level
25 50000 Bachelor's
35 70000 Master's
45 90000 Ph.D.
... ... ...

Task: Identify the number of clusters (K) that would be suitable for this data.

Answer: K = √(100 / 2) ≈ 7.07, which is rounded to 7 or 8 clusters.

Explanation: The marketing manager should consider 7 or 8 clusters to segment customers based on their demographic characteristics.

Last-Minute Revision

  • ⚠️ K-Means clustering assumes that the data is normally distributed.
  • The number of clusters (K) should be determined using the Elbow Method or Silhouette Score.
  • The Silhouette Score ranges from -1 to 1, where a high score indicates that a data point is well-clustered.
  • The centroid is the mean value of all data points in a cluster.
  • Initialization can lead to different cluster assignments, so it's essential to run K-Means multiple times with different initializations.
  • Convergence is typically achieved when the change in centroids is less than a certain threshold (e.g., 0.01).
  • WCSS measures the sum of squared distances between each data point and its assigned centroid.
  • BCSS measures the sum of squared distances between each centroid and the overall mean.
  • The number of clusters (K) may not always match the number of distinct customer segments.
  • The Silhouette Score is a measure of how well each data point fits into its assigned cluster, not a measure of cluster quality.
  • K-Means clustering can be sensitive to outliers.
  • The choice of distance metric can affect the results of K-Means clustering.
  • K-Means clustering can be used for both categorical and numerical data.