Fatskills
Practice. Master. Repeat.
Study Guide: TECH **Hypothesis Testing in Python (t-test, Chi-Square, p-values) – Zero-Fluff Study Guide**
Source: https://www.fatskills.com/introdution-to-engineering/chapter/tech-hypothesis-testing-in-python-t-test-chi-square-p-values-zero-fluff-study-guide

TECH **Hypothesis Testing in Python (t-test, Chi-Square, p-values) – Zero-Fluff Study Guide**

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~8 min read

Hypothesis Testing in Python (t-test, Chi-Square, p-values) – Zero-Fluff Study Guide

For Data Scientists who need to validate assumptions, debug experiments, and ship statistically sound models.


1. What This Is & Why It Matters

Hypothesis testing is how you prove (or disprove) assumptions about your data. Think of it like a courtroom trial for your model’s predictions: - Null Hypothesis (H₀): "The defendant (your model’s assumption) is innocent (correct)." - Alternative Hypothesis (H₁): "The defendant is guilty (wrong)." - p-value: The probability of seeing your data if H₀ were true. If p < 0.05, you "reject H₀" (guilty verdict).

Why this matters in production:
- A/B tests: Did your new recommendation algorithm actually improve click-through rates, or was it luck? - Feature selection: Does this new feature statistically improve model accuracy, or is it noise? - Data drift: Is today’s customer behavior significantly different from last month’s? (If yes, retrain your model.) - Regulatory compliance: If you’re in healthcare/finance, you must prove your model’s decisions aren’t biased (e.g., chi-square for fairness testing).

Real-world scenario:
You’re a DS at an e-commerce company. Your team launches a new checkout UI, and conversion rates look higher. But your boss asks: "Is this a real improvement, or just random noise? Should we roll it out to all users?" Hypothesis testing gives you the answer.


2. Core Concepts & Components

Term Definition Production Insight
Null Hypothesis (H₀) Default assumption: "No effect" or "No difference." Always start here. If you can’t reject H₀, your "improvement" is likely noise.
Alternative Hypothesis (H₁) What you want to prove: "There is an effect." Never "accept" H₁—only "fail to reject H₀."
p-value Probability of observing your data if H₀ were true. ⚠️ p < 0.05 ≠ "H₁ is true." It just means H₀ is unlikely.
Significance Level (α) Threshold for rejecting H₀ (usually 0.05). If α = 0.05, you’ll wrongly reject H₀ 5% of the time (Type I error).
t-test Tests if the means of two groups are different. Use for continuous data (e.g., "Do users spend more with the new UI?").
Independent t-test Compares means of two independent groups (e.g., control vs. treatment). Assumes equal variance (use Welch’s t-test if variances differ).
Paired t-test Compares means of the same group before/after (e.g., pre/post-treatment). More powerful than independent t-test when data is paired.
Chi-square test Tests if categorical variables are independent (e.g., "Is gender related to purchase?"). Use for A/B test results, fairness audits, or feature importance.
Degrees of Freedom (df) Number of values free to vary in a test. For t-test: df = n₁ + n₂ - 2. For chi-square: df = (rows-1)*(cols-1).
Effect Size Measures the magnitude of the difference (e.g., Cohen’s d). A "significant" p-value doesn’t mean the effect is meaningful. Always report effect size.


3. Step-by-Step Hands-On: Running Hypothesis Tests with scipy


Prerequisites

  • Python 3.8+ (use python --version to check).
  • Install scipy and pandas: bash pip install scipy pandas numpy matplotlib
  • A dataset with two groups to compare (we’ll use a synthetic A/B test dataset).


Task: Validate an A/B Test for a New Checkout UI

Goal: Determine if the new UI statistically improves conversion rates.


Step 1: Load and Inspect Data

import pandas as pd
import numpy as np
from scipy import stats

# Load synthetic A/B test data (conversion = 1 if purchased, 0 otherwise)
data = pd.read_csv("ab_test_data.csv")  # Columns: user_id, group (control/treatment), conversion
print(data.head())
print("\nGroup sizes:", data["group"].value_counts())

Expected output:


   user_id      group  conversion
0        1  treatment           1
1        2    control           0
2        3  treatment           0
3        4    control           1
4        5  treatment           1

Group sizes: treatment    5000
control 5000

Step 2: Check Assumptions

  • t-test: Data should be continuous (or binary for proportions), normally distributed, and have equal variance.
  • Chi-square: Data should be categorical, and expected frequencies > 5 per cell.
# Check normality (for t-test)
control = data[data["group"] == "control"]["conversion"]
treatment = data[data["group"] == "treatment"]["conversion"]

# Plot distributions (optional)
import matplotlib.pyplot as plt
plt.hist(control, alpha=0.5, label="Control")
plt.hist(treatment, alpha=0.5, label="Treatment")
plt.legend()
plt.show()

# Check variance equality (Levene's test)
levene_stat, levene_p = stats.levene(control, treatment)
print(f"Levene's test p-value: {levene_p:.4f}")  # If p > 0.05, variances are equal

Output:


Levene's test p-value: 0.1234  # Variances are equal (use standard t-test)

Step 3: Run the Appropriate Test

Option A: t-test (if data is continuous)


t_stat, p_value = stats.ttest_ind(treatment, control, equal_var=True)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")

Output:


t-statistic: 2.867, p-value: 0.0042  # Reject H₀: new UI improves conversions!

Option B: Chi-square (if data is categorical)


# Create contingency table
contingency_table = pd.crosstab(data["group"], data["conversion"])
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Chi-square statistic: {chi2_stat:.3f}, p-value: {p_value:.4f}")

Output:


Chi-square statistic: 8.212, p-value: 0.0042  # Same conclusion!

Step 4: Interpret Results

  • p-value = 0.0042 (<< 0.05) → Reject H₀: The new UI does improve conversions.
  • Effect size (Cohen’s d for t-test):
    python mean_diff = treatment.mean() - control.mean() pooled_std = np.sqrt((treatment.std()2 + control.std()2) / 2) cohen_d = mean_diff / pooled_std print(f"Cohen's d: {cohen_d:.3f}") # 0.1 = small, 0.3 = medium, 0.5 = large Output:
    Cohen's d: 0.127 # Small effect, but statistically significant

Step 5: Report Findings

Template for stakeholders:


"The new checkout UI increased conversion rates from 12.3% to 13.8% (p = 0.0042, Cohen’s d = 0.13). While the effect is small, it is statistically significant. We recommend rolling out the new UI to all users."




4. ? Production-Ready Best Practices


Statistical Rigor

  • Always check assumptions (normality, variance equality) before running tests.
  • Report effect sizes (e.g., Cohen’s d, Cramer’s V) alongside p-values. A "significant" p-value doesn’t mean the effect is meaningful.
  • Use Bonferroni correction for multiple comparisons (e.g., if testing 10 hypotheses, set α = 0.05/10 = 0.005).
  • Power analysis before running experiments: Use statsmodels.stats.power to determine sample size needed to detect an effect.

Code Maintainability

  • Wrap tests in functions for reusability: python def run_ab_test(control, treatment, test_type="t"):
    if test_type == "t":
    return stats.ttest_ind(treatment, control, equal_var=True)
    elif test_type == "chi2":
    contingency = pd.crosstab(control, treatment)
    return stats.chi2_contingency(contingency)
  • Log test parameters (e.g., sample sizes, p-values, effect sizes) for reproducibility.
  • Use pingouin for advanced tests (e.g., ANOVA, post-hoc tests): bash pip install pingouin

Business Impact

  • Align tests with business goals (e.g., "Does this feature increase revenue?" vs. "Does it increase clicks?").
  • Set a minimum detectable effect (MDE) before running tests (e.g., "We only care if conversion improves by ≥2%").
  • Monitor for Simpson’s Paradox (e.g., a test looks positive overall but negative for key segments).


5. ⚠️ Common Mistakes & Traps

Mistake Symptom Fix/Prevention
P-hacking (running tests until p < 0.05) "Significant" results that don’t replicate. Pre-register hypotheses and analysis plans. Use Bonferroni correction.
Ignoring effect size "Significant" p-value but tiny effect. Always report effect size (e.g., Cohen’s d, Cramer’s V).
Using t-test for non-normal data False positives/negatives. Use non-parametric tests (e.g., Mann-Whitney U) or transform data (log, sqrt).
Chi-square with small samples Expected frequencies < 5 in contingency table. Use Fisher’s exact test for small samples.
Confusing statistical vs. practical significance "Significant" result with no business impact. Set a minimum detectable effect (MDE) before running tests.


6. ? Exam/Certification Focus

Typical question patterns:
1. Interpret a p-value:
"A t-test returns p = 0.03. What does this mean?"
- ❌ "The null hypothesis is false."
- ✅ "There’s a 3% chance of observing this data if the null hypothesis were true."


  1. Choose the right test:
    "You want to compare the mean heights of two groups. Which test?"
  2. ✅ Independent t-test (if normally distributed).
  3. ❌ Chi-square (for categorical data).

  4. Effect size vs. p-value:
    "A test has p = 0.01 and Cohen’s d = 0.02. What’s the takeaway?"

  5. ✅ "Statistically significant but practically negligible."

  6. Chi-square assumptions:
    "When can’t you use a chi-square test?"

  7. ✅ When >20% of cells in the contingency table have expected frequencies < 5.

Key trap distinctions:
- t-test vs. chi-square:
- t-test: Continuous data, compares means.
- Chi-square: Categorical data, tests independence.
- Independent vs. paired t-test:
- Independent: Two separate groups (e.g., control vs. treatment).
- Paired: Same group before/after (e.g., pre/post-treatment).


7. ? Hands-On Challenge (with Solution)

Challenge:
You’re given a dataset of customer satisfaction scores (1-5) for two product versions. Run a test to determine if Version B is statistically better than Version A.

Data:


import pandas as pd
data = pd.DataFrame({
"version": ["A"]*100 + ["B"]*100,
"score": np.concatenate([np.random.normal(3.5, 1, 100), np.random.normal(3.7, 1, 100)]) })

Solution:


a_scores = data[data["version"] == "A"]["score"]
b_scores = data[data["version"] == "B"]["score"]

# Check normality (Shapiro-Wilk test)
print("Shapiro-Wilk p-values:", stats.shapiro(a_scores).pvalue, stats.shapiro(b_scores).pvalue)

# Run t-test (assuming normality)
t_stat, p_value = stats.ttest_ind(b_scores, a_scores, equal_var=True)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")

# Effect size (Cohen's d)
mean_diff = b_scores.mean() - a_scores.mean()
pooled_std = np.sqrt((a_scores.std()2 + b_scores.std()2) / 2)
cohen_d = mean_diff / pooled_std
print(f"Cohen's d: {cohen_d:.3f}")

Why it works:
- Shapiro-Wilk checks normality (p > 0.05 → normal).
- t-test compares means of two independent groups.
- Cohen’s d quantifies the effect size.


8. ? Rapid-Reference Crib Sheet

Task Code Notes
Independent t-test stats.ttest_ind(group1, group2, equal_var=True) Use equal_var=False if variances differ (Welch’s t-test).
Paired t-test stats.ttest_rel(before, after) For same subjects before/after.
Chi-square test stats.chi2_contingency(pd.crosstab(group, outcome)) Check expected frequencies > 5.
Mann-Whitney U stats.mannwhitneyu(group1, group2) Non-parametric alternative to t-test.
Shapiro-Wilk test stats.shapiro(data) Tests normality (p > 0.05 → normal).
Levene’s test stats.levene(group1, group2) Tests equal variance (p > 0.05 → equal variance).
Effect size (Cohen’s d) (mean1 - mean2) / pooled_std 0.2 = small, 0.5 = medium, 0.8 = large.
Bonferroni correction α = 0.05 / n_tests Adjusts p-value threshold for multiple comparisons.
Power analysis from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
analysis.solve_power(effect_size=0.5, nobs1=None, alpha=0.05, power=0.8)
Calculates required sample size.


9. ? Where to Go Next

  1. Scipy Stats Documentation – Official docs for all tests.
  2. Pingouin – Advanced statistical tests (ANOVA, post-hoc, etc.).
  3. Book: Practical Statistics for Data Scientists (Peter Bruce) – Covers hypothesis testing in depth.
  4. StatQuest: Hypothesis Testing – Best YouTube explanation (Josh Starmer).


ADVERTISEMENT