Fatskills
Practice. Master. Repeat.
Study Guide: TECH **Hypothesis Testing in Python (t-test, Chi-Square, p-values) – Zero-Fluff Study Guide**
Source: https://www.fatskills.com/introdution-to-engineering/chapter/tech-hypothesis-testing-in-python-t-test-chi-square-p-values-zero-fluff-study-guide

TECH Hypothesis Testing in Python (t-test, Chi-Square, p-values) – Zero-Fluff Study Guide

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~8 min read

Hypothesis Testing in Python (t-test, Chi-Square, p-values) – Zero-Fluff Study Guide

For Data Scientists who need to validate assumptions, debug experiments, and ship statistically sound models.

1. What This Is & Why It Matters

Hypothesis testing is how you prove (or disprove) assumptions about your data. Think of it like a courtroom trial for your model’s predictions: - Null Hypothesis (H₀): "The defendant (your model’s assumption) is innocent (correct)." - Alternative Hypothesis (H₁): "The defendant is guilty (wrong)." - p-value: The probability of seeing your data if H₀ were true. If p < 0.05, you "reject H₀" (guilty verdict).

Why this matters in production:
- A/B tests: Did your new recommendation algorithm actually improve click-through rates, or was it luck? - Feature selection: Does this new feature statistically improve model accuracy, or is it noise? - Data drift: Is today’s customer behavior significantly different from last month’s? (If yes, retrain your model.) - Regulatory compliance: If you’re in healthcare/finance, you must prove your model’s decisions aren’t biased (e.g., chi-square for fairness testing).

Real-world scenario:
You’re a DS at an e-commerce company. Your team launches a new checkout UI, and conversion rates look higher. But your boss asks: "Is this a real improvement, or just random noise? Should we roll it out to all users?" Hypothesis testing gives you the answer.

2. Core Concepts & Components

Term	Definition	Production Insight
Null Hypothesis (H₀)	Default assumption: "No effect" or "No difference."	Always start here. If you can’t reject H₀, your "improvement" is likely noise.
Alternative Hypothesis (H₁)	What you want to prove: "There is an effect."	Never "accept" H₁—only "fail to reject H₀."
p-value	Probability of observing your data if H₀ were true.	⚠️ `p < 0.05` ≠ "H₁ is true." It just means H₀ is unlikely.
Significance Level (α)	Threshold for rejecting H₀ (usually `0.05`).	If `α = 0.05`, you’ll wrongly reject H₀ 5% of the time (Type I error).
t-test	Tests if the means of two groups are different.	Use for continuous data (e.g., "Do users spend more with the new UI?").
Independent t-test	Compares means of two independent groups (e.g., control vs. treatment).	Assumes equal variance (use Welch’s t-test if variances differ).
Paired t-test	Compares means of the same group before/after (e.g., pre/post-treatment).	More powerful than independent t-test when data is paired.
Chi-square test	Tests if categorical variables are independent (e.g., "Is gender related to purchase?").	Use for A/B test results, fairness audits, or feature importance.
Degrees of Freedom (df)	Number of values free to vary in a test.	For t-test: `df = n₁ + n₂ - 2`. For chi-square: `df = (rows-1)*(cols-1)`.
Effect Size	Measures the magnitude of the difference (e.g., Cohen’s d).	A "significant" p-value doesn’t mean the effect is meaningful. Always report effect size.

3. Step-by-Step Hands-On: Running Hypothesis Tests with `scipy`

Prerequisites

Python 3.8+ (use python --version to check).
Install scipy and pandas: bash pip install scipy pandas numpy matplotlib
A dataset with two groups to compare (we’ll use a synthetic A/B test dataset).

Task: Validate an A/B Test for a New Checkout UI

Goal: Determine if the new UI statistically improves conversion rates.

Step 1: Load and Inspect Data

import pandas as pd
import numpy as np
from scipy import stats

# Load synthetic A/B test data (conversion = 1 if purchased, 0 otherwise)
data = pd.read_csv("ab_test_data.csv")  # Columns: user_id, group (control/treatment), conversion
print(data.head())
print("\nGroup sizes:", data["group"].value_counts())

Expected output:

   user_id      group  conversion
0        1  treatment           1
1        2    control           0
2        3  treatment           0
3        4    control           1
4        5  treatment           1

Group sizes: treatment    5000
             control      5000

Step 2: Check Assumptions

t-test: Data should be continuous (or binary for proportions), normally distributed, and have equal variance.
Chi-square: Data should be categorical, and expected frequencies > 5 per cell.

# Check normality (for t-test)
control = data[data["group"] == "control"]["conversion"]
treatment = data[data["group"] == "treatment"]["conversion"]

# Plot distributions (optional)
import matplotlib.pyplot as plt
plt.hist(control, alpha=0.5, label="Control")
plt.hist(treatment, alpha=0.5, label="Treatment")
plt.legend()
plt.show()

# Check variance equality (Levene's test)
levene_stat, levene_p = stats.levene(control, treatment)
print(f"Levene's test p-value: {levene_p:.4f}")  # If p > 0.05, variances are equal

Output:

Levene's test p-value: 0.1234  # Variances are equal (use standard t-test)

Step 3: Run the Appropriate Test

Option A: t-test (if data is continuous)

t_stat, p_value = stats.ttest_ind(treatment, control, equal_var=True)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")

Output:

t-statistic: 2.867, p-value: 0.0042  # Reject H₀: new UI improves conversions!

Option B: Chi-square (if data is categorical)

# Create contingency table
contingency_table = pd.crosstab(data["group"], data["conversion"])
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Chi-square statistic: {chi2_stat:.3f}, p-value: {p_value:.4f}")

Output:

Chi-square statistic: 8.212, p-value: 0.0042  # Same conclusion!

Step 4: Interpret Results

p-value = 0.0042 (<< 0.05) → Reject H₀: The new UI does improve conversions.
Effect size (Cohen’s d for t-test):
python mean_diff = treatment.mean() - control.mean() pooled_std = np.sqrt((treatment.std()2 + control.std()2) / 2) cohen_d = mean_diff / pooled_std print(f"Cohen's d: {cohen_d:.3f}") # 0.1 = small, 0.3 = medium, 0.5 = large Output:
Cohen's d: 0.127 # Small effect, but statistically significant

Step 5: Report Findings

Template for stakeholders:

"The new checkout UI increased conversion rates from 12.3% to 13.8% (p = 0.0042, Cohen’s d = 0.13). While the effect is small, it is statistically significant. We recommend rolling out the new UI to all users."

4. ? Production-Ready Best Practices

Statistical Rigor

Always check assumptions (normality, variance equality) before running tests.
Report effect sizes (e.g., Cohen’s d, Cramer’s V) alongside p-values. A "significant" p-value doesn’t mean the effect is meaningful.
Use Bonferroni correction for multiple comparisons (e.g., if testing 10 hypotheses, set α = 0.05/10 = 0.005).
Power analysis before running experiments: Use statsmodels.stats.power to determine sample size needed to detect an effect.

Code Maintainability

Wrap tests in functions for reusability: python def run_ab_test(control, treatment, test_type="t"): if test_type == "t": return stats.ttest_ind(treatment, control, equal_var=True) elif test_type == "chi2": contingency = pd.crosstab(control, treatment) return stats.chi2_contingency(contingency)
Log test parameters (e.g., sample sizes, p-values, effect sizes) for reproducibility.
Use pingouin for advanced tests (e.g., ANOVA, post-hoc tests): bash pip install pingouin

Business Impact

Align tests with business goals (e.g., "Does this feature increase revenue?" vs. "Does it increase clicks?").
Set a minimum detectable effect (MDE) before running tests (e.g., "We only care if conversion improves by ≥2%").
Monitor for Simpson’s Paradox (e.g., a test looks positive overall but negative for key segments).

5. ⚠️ Common Mistakes & Traps

Mistake	Symptom	Fix/Prevention
P-hacking (running tests until p < 0.05)	"Significant" results that don’t replicate.	Pre-register hypotheses and analysis plans. Use Bonferroni correction.
Ignoring effect size	"Significant" p-value but tiny effect.	Always report effect size (e.g., Cohen’s d, Cramer’s V).
Using t-test for non-normal data	False positives/negatives.	Use non-parametric tests (e.g., Mann-Whitney U) or transform data (log, sqrt).
Chi-square with small samples	Expected frequencies < 5 in contingency table.	Use Fisher’s exact test for small samples.
Confusing statistical vs. practical significance	"Significant" result with no business impact.	Set a minimum detectable effect (MDE) before running tests.

6. ? Exam/Certification Focus

Typical question patterns:
1. Interpret a p-value:
"A t-test returns p = 0.03. What does this mean?"
- ❌ "The null hypothesis is false."
- ✅ "There’s a 3% chance of observing this data if the null hypothesis were true."

Choose the right test:
"You want to compare the mean heights of two groups. Which test?"
✅ Independent t-test (if normally distributed).
❌ Chi-square (for categorical data).
Effect size vs. p-value:
"A test has p = 0.01 and Cohen’s d = 0.02. What’s the takeaway?"
✅ "Statistically significant but practically negligible."
Chi-square assumptions:
"When can’t you use a chi-square test?"
✅ When >20% of cells in the contingency table have expected frequencies < 5.

Key trap distinctions:
- t-test vs. chi-square:
- t-test: Continuous data, compares means.
- Chi-square: Categorical data, tests independence.
- Independent vs. paired t-test:
- Independent: Two separate groups (e.g., control vs. treatment).
- Paired: Same group before/after (e.g., pre/post-treatment).

7. ? Hands-On Challenge (with Solution)

Challenge:
You’re given a dataset of customer satisfaction scores (1-5) for two product versions. Run a test to determine if Version B is statistically better than Version A.

Data:

import pandas as pd
data = pd.DataFrame({
    "version": ["A"]*100 + ["B"]*100,
    "score": np.concatenate([np.random.normal(3.5, 1, 100), np.random.normal(3.7, 1, 100)])
})

Solution:

a_scores = data[data["version"] == "A"]["score"]
b_scores = data[data["version"] == "B"]["score"]

# Check normality (Shapiro-Wilk test)
print("Shapiro-Wilk p-values:", stats.shapiro(a_scores).pvalue, stats.shapiro(b_scores).pvalue)

# Run t-test (assuming normality)
t_stat, p_value = stats.ttest_ind(b_scores, a_scores, equal_var=True)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")

# Effect size (Cohen's d)
mean_diff = b_scores.mean() - a_scores.mean()
pooled_std = np.sqrt((a_scores.std()2 + b_scores.std()2) / 2)
cohen_d = mean_diff / pooled_std
print(f"Cohen's d: {cohen_d:.3f}")

Why it works:
- Shapiro-Wilk checks normality (p > 0.05 → normal).
- t-test compares means of two independent groups.
- Cohen’s d quantifies the effect size.

8. ? Rapid-Reference Crib Sheet

Task	Code	Notes
Independent t-test	`stats.ttest_ind(group1, group2, equal_var=True)`	Use `equal_var=False` if variances differ (Welch’s t-test).
Paired t-test	`stats.ttest_rel(before, after)`	For same subjects before/after.
Chi-square test	`stats.chi2_contingency(pd.crosstab(group, outcome))`	Check expected frequencies > 5.
Mann-Whitney U	`stats.mannwhitneyu(group1, group2)`	Non-parametric alternative to t-test.
Shapiro-Wilk test	`stats.shapiro(data)`	Tests normality (p > 0.05 → normal).
Levene’s test	`stats.levene(group1, group2)`	Tests equal variance (p > 0.05 → equal variance).
Effect size (Cohen’s d)	`(mean1 - mean2) / pooled_std`	0.2 = small, 0.5 = medium, 0.8 = large.
Bonferroni correction	`α = 0.05 / n_tests`	Adjusts p-value threshold for multiple comparisons.
Power analysis	`from statsmodels.stats.power import TTestIndPower` `analysis = TTestIndPower()` `analysis.solve_power(effect_size=0.5, nobs1=None, alpha=0.05, power=0.8)`	Calculates required sample size.

9. ? Where to Go Next

Scipy Stats Documentation – Official docs for all tests.
Pingouin – Advanced statistical tests (ANOVA, post-hoc, etc.).
Book: Practical Statistics for Data Scientists (Peter Bruce) – Covers hypothesis testing in depth.
StatQuest: Hypothesis Testing – Best YouTube explanation (Josh Starmer).

⚡ Recently practiced quizzes in this class

Data Analytics Practice Test Big Data & Analytics NASSCOM Certification Practice Test PySpark Practice Test Questions Basic Data Analytics and Visualization Practice Test (Tableau) Data Science Glossary Data Analysis with Python Data Science Exam #1 Data Analytics and Visualization Practice Test Pega Certified System Architect (PCSA) Study Guide Data Science Basics / Data Scientist Toolbox

➡️ Next Study Guide

TECH Hypothesis Testing in Python (t-test, Chi-Square, p-values) – Zero-Fluff Study Guide

Hypothesis Testing in Python (t-test, Chi-Square, p-values) – Zero-Fluff Study Guide

1. What This Is & Why It Matters

2. Core Concepts & Components

3. Step-by-Step Hands-On: Running Hypothesis Tests with `scipy`

Prerequisites

Task: Validate an A/B Test for a New Checkout UI

Step 1: Load and Inspect Data

Step 2: Check Assumptions

Step 3: Run the Appropriate Test

Step 4: Interpret Results

Step 5: Report Findings

4. ? Production-Ready Best Practices

Statistical Rigor

Code Maintainability

Business Impact

5. ⚠️ Common Mistakes & Traps

6. ? Exam/Certification Focus

7. ? Hands-On Challenge (with Solution)

8. ? Rapid-Reference Crib Sheet

9. ? Where to Go Next

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | OSHA Basics Quiz | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

TECH **Hypothesis Testing in Python (t-test, Chi-Square, p-values) – Zero-Fluff Study Guide**

Hypothesis Testing in Python (t-test, Chi-Square, p-values) – Zero-Fluff Study Guide

1. What This Is & Why It Matters

2. Core Concepts & Components

3. Step-by-Step Hands-On: Running Hypothesis Tests with scipy

Prerequisites

Task: Validate an A/B Test for a New Checkout UI

Step 1: Load and Inspect Data

Step 2: Check Assumptions

Step 3: Run the Appropriate Test

Step 4: Interpret Results

Step 5: Report Findings

4. ? Production-Ready Best Practices

Statistical Rigor

Code Maintainability

Business Impact

5. ⚠️ Common Mistakes & Traps

6. ? Exam/Certification Focus

7. ? Hands-On Challenge (with Solution)

8. ? Rapid-Reference Crib Sheet

9. ? Where to Go Next

❤ If you liked Fatskills, consider supporting us by checking out The Life Manuals You Never Got.

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | OSHA Basics Quiz | What Should We Know? Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson© 2026 Fatskills.com

All trademarks, logos and brand names are the property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, trademarks and brands does not imply endorsement.

TECH Hypothesis Testing in Python (t-test, Chi-Square, p-values) – Zero-Fluff Study Guide

3. Step-by-Step Hands-On: Running Hypothesis Tests with `scipy`

About | Explore | User Guide | Topics | Subjects | Doubt Solver | Career Aptitude Test | Answers | Free Tools | OSHA Basics Quiz | What Should We Know?
Privacy | Terms |

Without work one finishes nothing. - Ralph Waldo Emerson
© 2026 Fatskills.com