Fatskills
Practice. Master. Repeat.
Study Guide: Data Science and Machine Learning 101: Machine Learning Core Supervised Learning Regression Linear Polynomial Regularization Evaluation MSE RMSE R²
Source: https://www.fatskills.com/introdution-to-engineering/chapter/data-science-and-machine-learning-data-science-and-machine-learning-machine-learning-core-supervised-learning-regression-linear-polynomial-regularization-evaluation-mse-rmse-r%C2%B2

Data Science and Machine Learning 101: Machine Learning Core Supervised Learning Regression Linear Polynomial Regularization Evaluation MSE RMSE R²

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~5 min read

What This Is

Supervised regression learns a mapping (f(\mathbf{x})) from input features (\mathbf{x}) to a continuous target (y) using labeled examples. It’s the workhorse when you need to predict a numeric quantity—e.g., estimating next‑month house prices from location, size, and age, or forecasting daily electricity demand from weather and calendar data. Because the target is continuous, the model’s error can be measured directly, making regression ideal for budgeting, capacity planning, and any “how much?” business question.


Key Terms & Formulas

  • Linear Regression – Model: (\hat{y}= \beta_0 + \sum_{j=1}^{p}\beta_j x_j). (\beta) are coefficients learned by minimizing squared error.
  • Ordinary Least Squares (OLS) – Objective: (\displaystyle \min_{\beta}\; \sum_{i=1}^{n}(y_i-\hat{y}_i)^2). Gives closed‑form (\beta = (X^\top X)^{-1}X^\top y) when (X^\top X) is invertible.
  • Polynomial Regression – Extends linear model with powers of features: (\hat{y}= \beta_0 + \beta_1 x + \beta_2 x^2 + \dots + \beta_d x^d). Captures curvature while still fitting with OLS.
  • L1 Regularization (Lasso) – Penalty: (\displaystyle \lambda \sum_{j=1}^{p} |\beta_j|). Drives some coefficients exactly to 0 → built‑in feature selection.
  • L2 Regularization (Ridge) – Penalty: (\displaystyle \lambda \sum_{j=1}^{p} \beta_j^2). Shrinks coefficients toward 0 but never eliminates them; reduces variance.
  • Elastic Net – Combination: (\displaystyle \lambda_1\sum |\beta_j| + \lambda_2\sum \beta_j^2). Balances sparsity (L1) and stability (L2).
  • Mean Squared Error (MSE) – (\displaystyle \text{MSE}= \frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2). Primary loss for regression; lower = better.
  • Root Mean Squared Error (RMSE) – (\displaystyle \text{RMSE}= \sqrt{\text{MSE}}). Same units as (y); easier to interpret.
  • R‑squared (R²) – (\displaystyle R^2 = 1 - \frac{\sum (y_i-\hat{y}_i)^2}{\sum (y_i-\bar{y})^2}). Proportion of variance explained; 0 → no fit, 1 → perfect fit.
  • Train‑Test Split – Typical split: 70‑80 % train, 20‑30 % test (or use train_test_split(..., stratify=y) for time‑series cross‑validation).
  • Cross‑Validation (k‑fold) – Repeatedly train on (k-1) folds, validate on the held‑out fold; average metric gives a more robust estimate of generalization error.


Step‑by‑Step / Process Flow

  1. Load & Inspect
    python
    import pandas as pd
    df = pd.read_csv('house_prices.csv')
    df.head(); df.describe()
  2. Clean & Engineer – Handle missing values, encode categoricals, create interaction/polynomial features (PolynomialFeatures), and scale numeric columns (StandardScaler).
  3. Split
    python
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
  4. Baseline Model – Fit ordinary least‑squares linear regression.
    python
    from sklearn.linear_model import LinearRegression
    lin = LinearRegression().fit(X_train, y_train)
  5. Evaluate – Compute MSE, RMSE, R² on the hold‑out set.
    python
    from sklearn.metrics import mean_squared_error, r2_score
    preds = lin.predict(X_test)
    mse = mean_squared_error(y_test, preds)
    rmse = mse0.5
    r2 = r2_score(y_test, preds)
  6. Regularize & Tune – Use RidgeCV / LassoCV / ElasticNetCV to search over (\lambda) (or alpha) with cross‑validation, then re‑evaluate.
    python
    from sklearn.linear_model import RidgeCV
    ridge = RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5).fit(X_train, y_train)

Common Mistakes

Mistake Correction
Using MSE on a highly skewed target – large errors dominate, hiding systematic bias. Transform the target (log, Box‑Cox) or report MAE alongside RMSE to capture median error.
Fitting a high‑degree polynomial without regularization – overfits training data, terrible test performance. Apply Ridge/Lasso or limit degree; use cross‑validation to pick the sweet spot.
Scaling only after train‑test split – data leakage because the scaler sees test data. Fit the scaler on the training set (scaler.fit(X_train)) and apply the same transformation to both train and test (scaler.transform).
Ignoring multicollinearity – OLS coefficients become unstable when features are highly correlated. Detect with VIF; drop/reduce correlated columns or switch to Ridge (which handles collinearity).
Evaluating on the same data used for hyper‑parameter search – optimistic bias. Reserve a final hold‑out set or use nested cross‑validation for model selection and evaluation.


Data Science Interview / Practical Insights

  1. “When would you prefer Lasso over Ridge?” – Lasso when you need a sparse model (automatic feature selection) and the number of predictors exceeds the number of observations.
  2. “Explain why R² can be negative.” – If the model’s MSE is larger than the variance of the baseline (predicting the mean), the numerator exceeds the denominator, yielding a negative R²—signaling a worse‑than‑naïve model.
  3. “How does polynomial regression differ from adding interaction terms manually?” – Polynomial features automatically generate all powers up to the specified degree, including cross‑terms; manual interaction may miss higher‑order combos.
  4. “What’s the effect of the regularization strength λ on bias‑variance?” – Larger λ increases bias (under‑fitting) but reduces variance (over‑fitting); the sweet spot is found via CV.

Quick Check Questions

  1. Scenario: Your model’s training RMSE is 5, but test RMSE is 20.
    Answer: The model is over‑fitting; increase regularization (e.g., raise λ in Ridge/Lasso) or reduce model complexity.

  2. Scenario: You have 10,000 features but only 200 samples.
    Answer: Use Lasso (or Elastic Net) to enforce sparsity, or first perform dimensionality reduction (PCA) before regression.

  3. Scenario: After adding a quadratic term, R² improves from 0.70 to 0.71, but RMSE barely changes.
    Answer: The extra term adds little predictive power; the small R² gain may be noise—prefer the simpler model to avoid unnecessary complexity.


Last‑Minute Cram Sheet (10 one‑liners)

  1. OLS closed‑form: (\beta = (X^\top X)^{-1}X^\top y).
  2. Ridge loss: (\text{MSE} + \lambda|\beta|_2^2); Lasso loss: (\text{MSE} + \lambda|\beta|_1).
  3. RMSE = √MSE – same units as the target, easier to communicate to stakeholders.
  4. R² = 1 – (RSS/TSS); negative R² ⇒ model worse than predicting the mean.
  5. PolynomialFeatures(degree=d, include_bias=False) creates all combos up to (d).
  6. Cross‑validation (k‑fold) reduces variance of the performance estimate compared to a single train‑test split.
  7. StandardScaler subtracts mean, divides by std; ⚠️ assumes roughly Gaussian features—use MinMaxScaler for bounded, skewed data.
  8. Elastic Net α = λ₁ + λ₂, l1_ratio = λ₁/(λ₁+λ₂) in scikit‑learn.
  9. VIF > 5 signals problematic multicollinearity; consider dropping or regularizing.
  10. Bias‑variance trade‑off: ↑λ → ↑bias, ↓variance; ↓λ → ↓bias, ↑variance.

Keep this guide handy; you now have the core theory, the practical workflow, and the interview‑ready nuggets to own any regression‑focused data‑science task. Happy modeling!



ADVERTISEMENT