Fatskills
Practice. Master. Repeat.
Study Guide: Intro to Marketing Research: Data Preparation and Entry - Detecting Outliers, Univariate Multivariate Mahalanobis Distance
Source: https://www.fatskills.com/marketing-management/chapter/marketing-research-mktresearch-data-preparation-and-entry-detecting-outliers-univariate-multivariate-mahalanobis-distance

Intro to Marketing Research: Data Preparation and Entry - Detecting Outliers, Univariate Multivariate Mahalanobis Distance

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~4 min read

What It Is

Detecting outliers is a statistical method used to identify data points that are significantly different from the rest of the data. In marketing research, outliers can be misleading and affect the accuracy of analysis. A famous example is the case of the "Tylenol Tampering Crisis" in 1982, where seven people in the Chicago area died after taking Extra-Strength Tylenol capsules that had been laced with potassium cyanide. An investigation revealed that the tampering was not random but rather targeted specific individuals. This case highlights the importance of detecting outliers in product safety and quality control. By identifying unusual patterns, companies can prevent similar crises and ensure consumer safety.

Key Terms & Concepts

  • Outlier: A data point that is significantly different from the rest of the data.
  • Univariate outlier detection: A method that uses a single variable to identify outliers.
  • Multivariate outlier detection: A method that uses multiple variables to identify outliers.
  • Mahalanobis distance: A statistical method used to measure the distance between a data point and the center of a multivariate distribution.
  • Z-score: A measure of how many standard deviations a data point is away from the mean.
  • Boxplot: A graphical representation of a dataset that shows the median, quartiles, and outliers.
  • Q-Q plot: A graphical representation of a dataset that shows the quantiles of the data against a normal distribution.
  • Density plot: A graphical representation of a dataset that shows the distribution of the data.
  • K-means clustering: A method used to group similar data points into clusters.
  • Principal component analysis (PCA): A method used to reduce the dimensionality of a dataset.
  • Hotelling's T-squared statistic: A statistical method used to test for multivariate outliers.
  • Box-Cox transformation: A method used to transform non-normal data into a normal distribution.
  • Robust regression: A method used to estimate the relationship between variables while minimizing the effect of outliers.
  • Cook's distance: A measure of how much a data point affects the regression line.

Common Misunderstandings

  • Misunderstanding: Outliers are always bad and should be removed from the dataset.
  • Correction: Outliers can be useful in identifying unusual patterns or anomalies in the data. However, in some cases, outliers can be errors or anomalies that should be removed. It's essential to understand the context and purpose of the analysis before deciding whether to remove outliers.
  • Misunderstanding: Mahalanobis distance is only used for multivariate outlier detection.
  • Correction: Mahalanobis distance can be used for both univariate and multivariate outlier detection. However, it's more commonly used for multivariate outlier detection.
  • Misunderstanding: Outliers are always easy to identify.
  • Correction: Outliers can be difficult to identify, especially in large datasets. It's essential to use multiple methods and visualizations to detect outliers.

Quick Application / Identification

Scenario: A marketing research firm is analyzing customer satisfaction data for a new product. The data shows a customer who has a satisfaction score of 100, which is significantly higher than the rest of the data. What method would you use to detect this outlier?

Answer: Mahalanobis distance. This method would help identify the customer who has a satisfaction score of 100 as an outlier.

Explanation: By using Mahalanobis distance, the researcher can measure the distance between the customer's data point and the center of the multivariate distribution, which would indicate that the customer is an outlier.

Last-Minute Revision

  • Z-score formula: Z = (X - ?) / ?
  • Mahalanobis distance formula: D^2 = (X - ?)^T ?^(-1) (X - ?)
  • Hotelling's T-squared statistic formula: T^2 = (X - ?)^T ?^(-1) (X - ?)
  • Box-Cox transformation formula: Y = (X^? - 1) / ?
  • Robust regression: A method that uses a robust estimator, such as the median, to estimate the relationship between variables.
  • Cook's distance: A measure of how much a data point affects the regression line.
  • Q-Q plot: A graphical representation of a dataset that shows the quantiles of the data against a normal distribution.
  • Density plot: A graphical representation of a dataset that shows the distribution of the data.
  • K-means clustering: A method used to group similar data points into clusters.
  • Principal component analysis (PCA): A method used to reduce the dimensionality of a dataset.
  • Outlier detection methods: Z-score, Mahalanobis distance, Hotelling's T-squared statistic, Box-Cox transformation, robust regression, Cook's distance, Q-Q plot, density plot, k-means clustering, and PCA.
  • Assumption of normality: Many statistical methods assume that the data is normally distributed.
  • Assumption of equal variance: Many statistical methods assume that the data has equal variance.
  • Assumption of linearity: Many statistical methods assume that the relationship between variables is linear.