Fatskills
Practice. Master. Repeat.
Study Guide: AP Statistics (AP Stats): Outliers and Influential Points in Regression
Source: https://www.fatskills.com/ap-statistics/chapter/ap-stats-ap-statistics-outliers-and-influential-points-in-regression

AP Statistics (AP Stats): Outliers and Influential Points in Regression

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~6 min read

AP Statistics – Outliers and Influential Points in Regression

AP Statistics: Outliers and Influential Points in Regression – Exam-Ready Study Guide


What This Is

Outliers and influential points are data points that can distort the results of a linear regression analysis. An outlier is a point with an unusually large residual (far from the regression line), while an influential point is a point that, if removed, significantly changes the slope, y-intercept, or correlation of the regression line. These concepts are critical on the AP exam because they test your ability to assess the reliability of a regression model—key in real-world scenarios like predicting house prices from square footage, analyzing the effect of study time on test scores, or evaluating the impact of a single extreme value (e.g., a billionaire’s income in a salary dataset).


Key Terms & Formulas

  • Outlier (in regression): A point with a large residual (observed y – predicted ?). Often identified using the 1.5 × IQR rule for residuals (Q1 – 1.5×IQR, Q3 + 1.5×IQR).
  • Influential point: A point that substantially changes the regression line (slope, intercept, or r) when removed. Often has high leverage.
  • Leverage (h?): A measure of how far a point’s x-value is from the mean of x. High leverage points have x-values far from x?. Calculated in regression output (TI-84: STAT-CALC-8:LinReg(a+bx)-DIAGNOSTIC ON).
  • Residual (e): e = y – ? (observed – predicted). Plotted on a residual plot to check linearity.
  • Residual plot: A scatterplot of residuals (e) vs. x or ?. No pattern = good linear fit; curved pattern = nonlinear relationship.
  • Cook’s Distance (D): A measure of a point’s influence. D > 1 suggests high influence (TI-84: STAT-EDIT-L3 = residuals, then calculate manually or use 2nd-LIST-OPS-7:?List).
  • Correlation coefficient (r): Measures strength/direction of linear relationship. ?1-r-1. Influential points can artificially inflate or deflate r.
  • Coefficient of determination (): Proportion of variance in y explained by x. 0-r²-1. Influential points can mislead by making the model seem stronger/weaker than it is.
  • Slope (b) of LSRL: b = r × (s_y / s_x). Influential points can drastically change b.
  • TI-84: Linear Regression: STAT-CALC-8:LinReg(a+bx) L1, L2, Y1 (stores equation in Y1 for predictions).
  • TI-84: Residuals: After running regression, STAT-EDIT-L3 = RESID (pre-calculated residuals).
  • TI-84: Residual Plot: 2nd-Y= (STAT PLOT)-Plot1-Xlist:L1, Ylist:L3-ZOOM-9:ZoomStat.

Step-by-Step / Process Flow

How to analyze outliers and influential points in an FRQ:

  1. Plot the data and regression line
  2. Create a scatterplot (2nd-Y=-Plot1-L1, L2).
  3. Run linear regression (STAT-CALC-8:LinReg(a+bx) L1, L2, Y1).
  4. Sketch the LSRL on the scatterplot.

  5. Identify potential outliers

  6. Calculate residuals (e = y – ?) or use L3 = RESID.
  7. Check for points with large residuals (visually or using the 1.5×IQR rule).
  8. Look for points far from the regression line in the scatterplot.

  9. Check for influential points

  10. Leverage: Points with x-values far from x? (e.g., min/max x).
  11. Cook’s Distance: Calculate D for each point (if D > 1, it’s influential).
  12. Remove the point: Re-run regression without it. If the slope/intercept changes substantially, the point is influential.

  13. Interpret the impact

  14. Does the point strengthen or weaken r? (Compare r with/without the point.)
  15. Does it change the slope’s sign or magnitude? (Compare b with/without the point.)
  16. Does it affect predictions? (Compare ? for key x-values.)

  17. Draw conclusions

  18. If the point is not influential, keep it in the model.
  19. If it’s influential, consider:
    • Removing it (if it’s a data entry error).
    • Reporting results with and without the point.
    • Using a nonlinear model if the residual plot suggests curvature.

Common Mistakes

  • Mistake: Assuming all outliers are influential. Correction: Not all outliers have high leverage. A point with a large residual but x near x? may not change the regression line much. Check leverage and Cook’s Distance!

  • Mistake: Ignoring the residual plot when assessing linearity. Correction: A scatterplot alone can hide nonlinear patterns. Always check the residual plot—a curved pattern means the linear model is inappropriate, even if r is high.

  • Mistake: Deleting influential points without justification. Correction: Only remove points if they’re data errors (e.g., typos) or not representative of the population. Never remove points just to improve r or .

  • Mistake: Confusing r with . Correction: r measures strength/direction of the linear relationship; measures proportion of variance explained. An influential point can change r from 0.8 to 0.3 (big impact) but from 0.64 to 0.09 (even bigger impact).

  • Mistake: Forgetting to turn on DIAGNOSTIC for r and on the TI-84. Correction: Always run 2nd-0 (CATALOG)-DiagnosticOn-ENTER before regression. Otherwise, r and won’t display!


AP Exam Insights

  • Tricky Distinction: The AP exam loves to test whether a point is an outlier, influential, both, or neither. Remember:
  • Outlier = large residual.
  • Influential = changes the regression line (often high leverage).
  • A point can be one, both, or neither!

  • Common FRQ Setup:

  • Given a scatterplot with a labeled point (e.g., "Point A").
  • Asked to:

    1. Explain why Point A is an outlier.
    2. Determine if Point A is influential (often requires re-running regression without it).
    3. Describe how removing Point A affects r, , or the slope.
  • Calculator Pitfall: Students forget to store the regression equation in Y1, making it hard to calculate residuals or make predictions. Always use LinReg(a+bx) L1, L2, Y1!

  • Context Matters: The AP exam expects contextual explanations. For example:

  • ? "Removing the point increases r."
  • ? "Removing the point (a student who studied 1 hour but scored 95%) increases r because the remaining data better fits a linear model."

Quick Check Questions

1. Multiple Choice

A regression analysis of y = house price (in $1000s) vs. x = square footage yields the following: - LSRL: ? = 50 + 0.1x - r = 0.85 - Residual for a 2,000 sq. ft. house: ?$150,000

Which of the following is true? (A) The house is an outlier but not influential. (B) The house is influential but not an outlier. (C) The house is both an outlier and influential. (D) The house is neither an outlier nor influential.

Answer: (A) The house is an outlier (large residual: ?$150,000) but not necessarily influential (we’d need to check leverage/Cook’s Distance).


2. FRQ Part

A researcher fits a LSRL to predict y = crop yield (kg) from x = fertilizer amount (g). The regression output is below:

x (g) y (kg)
10 20
20 30
30 45
40 50
50 60
100 20

a. Identify the potential outlier. Explain why it might be an outlier. b. Without calculating, explain whether this point is likely influential. Justify your answer.

Answer: a. The point (100, 20) is a potential outlier because its y-value (20 kg) is much lower than predicted (?-50 + 1×100 = 150 kg, so residual-?130 kg). b. It is likely influential because its x-value (100 g) is far from the mean x (?42 g), giving it high leverage. Removing it would likely increase the slope and r.


Last-Minute Cram Sheet

  1. Outlier = large residual (use 1.5×IQR rule for residuals).
  2. Influential point = changes regression line (check leverage/Cook’s Distance).
  3. Leverage = how far x is from x? (high leverage = potential influence).
  4. Cook’s Distance (D) > 1 = influential point. Not on TI-84 by default—calculate manually!
  5. Residual plot should show no pattern for a good linear fit.
  6. TI-84: LinReg(a+bx) L1, L2, Y1-stores equation in Y1 for predictions.
  7. TI-84: L3 = RESID-pre-calculated residuals.
  8. TI-84: DiagnosticOn-shows r and in regression output.
  9. Removing an influential point can increase or decrease r—don’t assume!
  10. Always justify in context! (e.g., "The point is an outlier because its residual is 3 SDs from the mean.")