ML Interview QA - 2

10 essential ML interview questions on evaluation metrics, feature engineering, PCA, XGBoost, imbalanced data, missing values, and data leakage — with diagrams and examples.

Author

Vectoring AI

Published

17 May 2026

Keywords

machine learning interview, precision recall F1, ROC AUC curve, imbalanced dataset, feature engineering, PCA, XGBoost gradient boosting, missing data imputation, data leakage, curse of dimensionality, generative discriminative models

Introduction

This is Part 2 of our ML Interview QA series. It covers 10 questions on evaluation metrics, feature engineering, and data handling — the practical skills that separate candidates who build models from those who build reliable models.

For foundational concepts (bias-variance, algorithms, ensembles), see ML Interview QA - 1.

Q1: Explain precision, recall, and F1-score.

Answer:

These metrics go beyond accuracy to measure specific types of errors in classification.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph cm["Confusion Matrix"]
        direction TB
        TP["True Positive (TP)<br/>Correctly predicted positive"]
        FP["False Positive (FP)<br/>Incorrectly predicted positive<br/>(Type I error)"]
        FN["False Negative (FN)<br/>Missed positive<br/>(Type II error)"]
        TN["True Negative (TN)<br/>Correctly predicted negative"]
    end

    cm --> PREC["Precision = TP/(TP+FP)<br/>Of those I flagged,<br/>how many are correct?"]
    cm --> REC["Recall = TP/(TP+FN)<br/>Of all positives,<br/>how many did I catch?"]
    PREC --> F1["F1 = 2·P·R/(P+R)<br/>Harmonic mean<br/>balances both"]
    REC --> F1

    style TP fill:#56cc9d,stroke:#333,color:#fff
    style TN fill:#56cc9d,stroke:#333,color:#fff
    style FP fill:#ff7851,stroke:#333,color:#fff
    style FN fill:#ffce67,stroke:#333
    style F1 fill:#6cc3d5,stroke:#333,color:#fff
    style cm fill:#fff,color:#fff

Formulas

\text{Precision} = \frac{TP}{TP + FP} \qquad \text{Recall} = \frac{TP}{TP + FN} \qquad F1 = \frac{2 \cdot P \cdot R}{P + R}

Example: Email Spam Filter

Out of 1000 emails: 50 actual spam, 950 legitimate.

Scenario	TP	FP	FN	Precision	Recall	F1
Aggressive filter	48	30	2	48/78 = 0.62	48/50 = 0.96	0.75
Conservative filter	40	2	10	40/42 = 0.95	40/50 = 0.80	0.87

Aggressive: Catches almost all spam (high recall) but blocks 30 good emails (low precision)
Conservative: Rarely blocks good emails (high precision) but misses 10 spam (lower recall)

The Precision-Recall Tradeoff

graph LR
    linkStyle default stroke:#000,color:#000
    subgraph low_threshold["Low Threshold (0.2)"]
        LT["Predict more as positive<br/>↑ Recall, ↓ Precision"]
    end
    subgraph mid_threshold["Medium Threshold (0.5)"]
        MT["Balanced trade-off"]
    end
    subgraph high_threshold["High Threshold (0.8)"]
        HT["Predict fewer as positive<br/>↓ Recall, ↑ Precision"]
    end

    low_threshold --> mid_threshold --> high_threshold

    style low_threshold fill:#6cc3d5,stroke:#333,color:#fff
    style mid_threshold fill:#56cc9d,stroke:#333,color:#fff
    style high_threshold fill:#ffce67,stroke:#333

When to Prioritize Which

Metric	Prioritize When	Example
Precision	False positives are costly	Spam filter (don’t block important emails)
Recall	False negatives are costly	Cancer screening (don’t miss tumors)
F1	Need single balanced metric	General classification with imbalanced classes
F-beta	Custom tradeoff needed	F2 (recall 2x important), F0.5 (precision 2x important)

Application

Fraud detection: Optimize recall (catch all fraud), accept some false positives that humans review
Search engines: Optimize precision (show only relevant results)
Medical AI: Regulatory bodies often require minimum recall thresholds
Content moderation: Balance — too aggressive frustrates users, too lenient misses harmful content

Q2: What is the ROC curve and AUC?

Answer:

The ROC curve (Receiver Operating Characteristic) visualizes classifier performance across all possible thresholds by plotting True Positive Rate vs. False Positive Rate.

TPR = \frac{TP}{TP + FN} \qquad FPR = \frac{FP}{FP + TN}

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph roc["ROC Curve Interpretation"]
        direction TB
        PERFECT["Perfect classifier<br/>AUC = 1.0<br/>(top-left corner)"]
        GOOD["Good classifier<br/>AUC = 0.85<br/>(curve above diagonal)"]
        RANDOM["Random guessing<br/>AUC = 0.5<br/>(diagonal line)"]
        WORST["Inverse classifier<br/>AUC = 0.0<br/>(below diagonal)"]
    end

    style PERFECT fill:#56cc9d,stroke:#333,color:#fff
    style GOOD fill:#6cc3d5,stroke:#333,color:#fff
    style RANDOM fill:#ffce67,stroke:#333
    style WORST fill:#ff7851,stroke:#333,color:#fff
    style roc fill:#fff,color:#333

How It Works: Threshold Sweep

graph LR
    linkStyle default stroke:#000,color:#000
    A["Model outputs<br/>probabilities<br/>for each sample"] --> B["Sweep threshold<br/>from 0.0 to 1.0"]
    B --> C["At each threshold:<br/>compute TPR and FPR"]
    C --> D["Plot all (FPR, TPR)<br/>points → ROC curve"]
    D --> E["Area Under Curve<br/>= AUC score"]

    style E fill:#56cc9d,stroke:#333,color:#fff

Example: Comparing Two Models

from sklearn.metrics import roc_auc_score, roc_curve

# Model A: Logistic Regression
y_prob_A = model_A.predict_proba(X_test)[:, 1]
auc_A = roc_auc_score(y_test, y_prob_A)  # 0.82

# Model B: Random Forest
y_prob_B = model_B.predict_proba(X_test)[:, 1]
auc_B = roc_auc_score(y_test, y_prob_B)  # 0.91

# Model B has better discrimination power
# It ranks positives higher than negatives more consistently

Interpretation of AUC = 0.91: If you randomly pick one positive sample and one negative sample, there’s a 91% probability that the model assigns a higher score to the positive sample.

When ROC-AUC Fails: Imbalanced Data

graph TD
    linkStyle default stroke:#000,color:#000
    A["Dataset: 10,000 samples<br/>9,900 negative, 100 positive"] --> B["Model predicts ALL as negative"]
    B --> C["FPR = 0/(0+9900) = 0<br/>TPR = 0/(0+100) = 0"]
    B --> D["Accuracy = 99%<br/>Looks great!"]
    B --> E["ROC-AUC can still be<br/>misleadingly high"]

    E --> F["Solution: Use PR-AUC<br/>(Precision-Recall AUC)<br/>for imbalanced data"]

    style D fill:#ff7851,stroke:#333,color:#fff
    style F fill:#56cc9d,stroke:#333,color:#fff

ROC-AUC vs. PR-AUC

Metric	Best For	Why
ROC-AUC	Balanced datasets	Considers both classes equally
PR-AUC	Imbalanced datasets (rare positives)	Focuses on positive class performance

Application

Model selection: Compare models that output probabilities (higher AUC = better ranking)
Threshold selection: Pick the operating point on the ROC curve that matches business needs
Clinical trials: Evaluate diagnostic tests across different decision thresholds
Credit scoring: Regulators compare AUC across demographic groups for fairness

Q3: How do you handle imbalanced datasets?

Answer:

Class imbalance occurs when one class vastly outnumbers the other (e.g., 99% negative, 1% positive). Standard accuracy becomes meaningless — a model predicting “always negative” gets 99% accuracy.

graph TD
    linkStyle default stroke:#000,color:#000
    PROBLEM["Imbalanced Dataset<br/>e.g., 1% fraud, 99% legitimate"] --> APPROACH["Multi-level approach"]

    APPROACH --> L1["Level 1: Metrics<br/>(change how you measure)"]
    APPROACH --> L2["Level 2: Algorithm<br/>(change how model learns)"]
    APPROACH --> L3["Level 3: Data<br/>(change the data itself)"]

    L1 --> L1A["Use F1, PR-AUC, recall<br/>instead of accuracy"]
    L2 --> L2A["Class weights<br/>Threshold tuning<br/>Cost-sensitive learning"]
    L3 --> L3A["SMOTE (oversample minority)<br/>Undersample majority<br/>Collect more minority data"]

    style L1 fill:#56cc9d,stroke:#333,color:#fff
    style L2 fill:#6cc3d5,stroke:#333,color:#fff
    style L3 fill:#ffce67,stroke:#333

Strategy Priority (use in order)

graph TD
    linkStyle default stroke:#000,color:#000
    S1["1. Fix your METRICS first<br/>Stop using accuracy"] --> S2["2. Try CLASS WEIGHTS<br/>(free, no data changes)"]
    S2 --> S3["3. Tune THRESHOLD<br/>(adjust decision boundary)"]
    S3 --> S4["4. Try RESAMPLING<br/>(SMOTE, undersampling)"]
    S4 --> S5["5. Use specialized ENSEMBLES<br/>(Balanced RF, EasyEnsemble)"]

    style S1 fill:#56cc9d,stroke:#333,color:#fff
    style S2 fill:#56cc9d,stroke:#333,color:#fff
    style S3 fill:#6cc3d5,stroke:#333,color:#fff
    style S4 fill:#ffce67,stroke:#333,color:#fff
    style S5 fill:#ff7851,stroke:#333,color:#fff

Example: Fraud Detection (0.3% fraud rate)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, precision_recall_curve

# BAD: Default model
rf_default = RandomForestClassifier(n_estimators=100)
rf_default.fit(X_train, y_train)
# Accuracy: 99.7% → but catches only 20% of fraud!

# BETTER: Class weights
rf_weighted = RandomForestClassifier(
    n_estimators=100,
    class_weight={0: 1, 1: 333}  # inverse of class frequency
)
rf_weighted.fit(X_train, y_train)
# Recall: 85% fraud caught, precision: 12% → many false alerts

# BEST: Threshold tuning after weighting
y_proba = rf_weighted.predict_proba(X_test)[:, 1]
# Find threshold where precision ≥ 5% AND recall ≥ 80%
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
# Choose threshold = 0.35 → Recall: 82%, Precision: 8%
# Human reviewers handle 8% false alert rate

SMOTE (Synthetic Minority Oversampling)

SMOTE creates synthetic minority samples by interpolating between existing minority samples and their k-nearest neighbors.

graph LR
    linkStyle default stroke:#000,color:#000
    A["Minority sample A"] --> MID["New synthetic sample<br/>(random point between A and B)"]
    B["Nearest neighbor B"] --> MID

    style MID fill:#56cc9d,stroke:#333,color:#fff
    style A fill:#6cc3d5,stroke:#333,color:#fff
    style B fill:#6cc3d5,stroke:#333,color:#fff

Caution: Always apply SMOTE only on training data (after splitting) — never on test/validation sets.

Application

Domain	Imbalance Ratio	Strategy
Fraud detection	1:1000	Class weights + threshold tuning + human review
Disease diagnosis	1:100	SMOTE + ensemble + high recall threshold
Manufacturing defects	1:500	Anomaly detection (one-class SVM, Isolation Forest)
Click prediction	1:50	Calibrated probabilities + ranking metrics

Q4: What is feature engineering and why does it matter?

Answer:

Feature engineering is the process of creating, transforming, and selecting input features to improve model performance. It often has a greater impact than model choice or hyperparameter tuning.

graph LR
    linkStyle default stroke:#000,color:#000
    RAW["Raw Data"] --> FE["Feature Engineering"]

    FE --> CREATE["Create new features<br/>(domain knowledge)"]
    FE --> TRANSFORM["Transform existing features<br/>(scaling, encoding)"]
    FE --> SELECT["Select relevant features<br/>(remove noise)"]

    CREATE --> EX1["age + income <br/> → income_per_year_of_age"]
    CREATE --> EX2["timestamp <br/> → hour_of_day, is_weekend"]
    CREATE --> EX3["lat + lon <br/> → distance_to_store"]

    TRANSFORM --> EX4["log(income) <br/> — reduce skew"]
    TRANSFORM --> EX5["one-hot(city) <br/> — encode categories"]
    TRANSFORM --> EX6["StandardScaler <br/> — normalize ranges"]

    SELECT --> EX7["Remove correlated features"]
    SELECT --> EX8["L1 regularization <br/> → sparsity"]
    SELECT --> EX9["Tree importance scores"]

    style FE fill:#56cc9d,stroke:#333,color:#fff
    style CREATE fill:#6cc3d5,stroke:#333,color:#fff
    style TRANSFORM fill:#ffce67,stroke:#333
    style SELECT fill:#ff7851,stroke:#333,color:#fff

Example: Predicting Taxi Trip Duration

Raw features: pickup_time, pickup_lat, pickup_lon, dropoff_lat, dropoff_lon

Engineered features (much more predictive):

import numpy as np

# Distance (Haversine formula)
df['distance_km'] = haversine(
    df['pickup_lat'], df['pickup_lon'],
    df['dropoff_lat'], df['dropoff_lon']
)

# Time-based features
df['hour'] = df['pickup_time'].dt.hour
df['is_rush_hour'] = df['hour'].isin([7,8,9,17,18,19]).astype(int)
df['is_weekend'] = df['pickup_time'].dt.dayofweek.isin([5,6]).astype(int)

# Interaction features
df['distance_x_rush'] = df['distance_km'] * df['is_rush_hour']
# ^ During rush hour, distance has a MUCH bigger impact on duration

# Aggregation features
df['avg_speed_this_hour'] = df.groupby('hour')['speed'].transform('mean')

Result: Model accuracy improves from R² = 0.45 (raw features) to R² = 0.82 (engineered features) — same model, better features.

Feature Selection Methods

Method	Type	How it works	When to use
Correlation filter	Filter	Remove features correlated > 0.95 with others	Quick first pass
Mutual information	Filter	Keep features with high MI with target	Non-linear relationships
Recursive elimination	Wrapper	Repeatedly remove least important feature	When compute allows
L1 regularization	Embedded	Model zeros out irrelevant weights	Linear models
Tree importance	Embedded	Features that reduce impurity most	Tree-based models

Application

E-commerce: RFM features (Recency, Frequency, Monetary) from transaction logs
NLP: TF-IDF, n-grams, embedding features from text
Finance: Moving averages, volatility, technical indicators from price data
Computer Vision: HOG features, edge histograms (classical), or learned features (deep learning)

Q5: What is the curse of dimensionality?

Answer:

As features increase, the feature space grows exponentially, making data increasingly sparse and distance metrics less meaningful.

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph d1["1D: Line"]
        D1["10 points fill a line well<br/>Dense coverage"]
    end
    subgraph d2["2D: Square"]
        D2["10 points in a square<br/>Getting sparse"]
    end
    subgraph d3["3D: Cube"]
        D3["10 points in a cube<br/>Very sparse"]
    end
    subgraph d100["100D: Hypercube"]
        D100["10 points in 100 dimensions<br/>Essentially EMPTY<br/>Need 10¹⁰⁰ points to fill!"]
    end

    d1 --> d2 --> d3 --> d100

    style d1 fill:#56cc9d,stroke:#333,color:#fff
    style d2 fill:#6cc3d5,stroke:#333,color:#fff
    style d3 fill:#ffce67,stroke:#333
    style d100 fill:#ff7851,stroke:#333,color:#fff

Why This Matters: Distances Become Meaningless

In high dimensions, the ratio of maximum to minimum distance between any pair of points approaches 1:

\lim_{d \to \infty} \frac{dist_{max} - dist_{min}}{dist_{min}} \to 0

This means all points are approximately equidistant, which destroys distance-based algorithms.

Example: KNN Fails in High Dimensions

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification

# Low dimension: KNN works great
X_low, y = make_classification(n_features=5, n_informative=5)
knn = KNeighborsClassifier(n_neighbors=5)
# Accuracy: 92%

# High dimension: KNN fails
X_high, y = make_classification(n_features=500, n_informative=5)
knn = KNeighborsClassifier(n_neighbors=5)
# Accuracy: 55% — barely better than random!
# Because with 500 features, "nearest" neighbors aren't really near

Models Most Affected

Severely affected	Somewhat resilient	Why
KNN	Decision Trees	Trees split on one feature at a time
K-Means	Random Forest	Feature subsampling helps
SVM (RBF kernel)	Gradient Boosting	Sequential error correction
Gaussian processes	Neural Networks (with dropout)	Learn relevant subspaces

Solutions

graph LR
    linkStyle default stroke:#000,color:#000
    CURSE["Curse of<br/>Dimensionality"] --> S1["Feature Selection<br/>(keep only<br/>informative features)"]
    CURSE --> S2["PCA / Autoencoders<br/>(project to<br/>lower dimensions)"]
    CURSE --> S3["Regularization<br/>(L1 drives irrelevant<br/>weights to zero)"]
    CURSE --> S4["Domain Knowledge<br/>(only include<br/>meaningful features)"]
    CURSE --> S5["Get More Data<br/>(fill the space better)"]

    style S1 fill:#56cc9d,stroke:#333,color:#fff
    style S2 fill:#6cc3d5,stroke:#333,color:#fff
    style S3 fill:#ffce67,stroke:#333

Application

Genomics: 20,000 genes, 100 patients — need aggressive feature selection
Text/NLP: Bag-of-words creates 100K+ features — use TF-IDF + dimensionality reduction
Image data: Raw pixels (millions of dimensions) — use CNNs to learn lower-dimensional representations
Recommendation systems: Millions of items → embedding spaces reduce dimensionality

Q6: Explain PCA (Principal Component Analysis).

Answer:

PCA is an unsupervised technique that finds the directions of maximum variance in the data and projects data onto a lower-dimensional subspace.

graph TD
    linkStyle default stroke:#000,color:#000
    A["Original data<br/>(d dimensions)"] --> B["Standardize features<br/>(mean=0, std=1)"]
    B --> C["Compute covariance matrix<br/>(d × d)"]
    C --> D["Find eigenvectors & eigenvalues"]
    D --> E["Sort by eigenvalue<br/>(variance explained)"]
    E --> F["Select top k components<br/>(capture 95% variance)"]
    F --> G["Project data onto k dimensions"]

    style A fill:#6cc3d5,stroke:#333,color:#fff
    style F fill:#56cc9d,stroke:#333,color:#fff
    style G fill:#ffce67,stroke:#333

How It Works: Intuition

Imagine data scattered in 3D space but most of the spread is in a 2D plane. PCA finds that plane (the directions of maximum variance) and projects all points onto it — reducing from 3D to 2D with minimal information loss.

Example: Dimensionality Reduction for Visualization

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Original: 50 features
X_scaled = StandardScaler().fit_transform(X)  # Always scale first!

# Reduce to 2D for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

# How much information is preserved?
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.2%}")
# Output: "Variance explained: 72.4%"
# → 2 components capture 72.4% of the total variance

# For modeling: find k that captures 95%
pca_95 = PCA(n_components=0.95)  # auto-select k
X_reduced = pca_95.fit_transform(X_scaled)
print(f"Components needed for 95%: {pca_95.n_components_}")
# Output: "Components needed for 95%: 12"
# → Reduced from 50 to 12 features!

When to Use and When Not

graph LR
    linkStyle default stroke:#000,color:#000
    PCA_NODE["PCA"] --> USE["✅ Use when"]
    PCA_NODE --> AVOID["❌ Avoid when"]

    USE --> U1["Features are correlated<br/>(redundant)"]
    USE --> U2["Visualization needed<br/>(reduce to 2-3D)"]
    USE --> U3["Speed up training<br/>(fewer features)"]
    USE --> U4["Reduce noise<br/>(drop low-variance components)"]

    AVOID --> A1["Features are<br/>already independent"]
    AVOID --> A2["Interpretability is critical<br/>(components are<br/>hard to explain)"]
    AVOID --> A3["Non-linear relationships<br/>dominate (use t-SNE,<br/>UMAP, or autoencoders)"]

    style USE fill:#56cc9d,stroke:#333,color:#fff
    style AVOID fill:#ff7851,stroke:#333,color:#fff

Application

Image compression: Reduce image from 784 pixels (28×28) to 50 components
Genomics: Visualize population structure from thousands of genetic markers
Finance: Identify latent factors driving asset returns
Preprocessing: Remove multicollinearity before linear regression

Q7: What is the difference between generative and discriminative models?

Answer:

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph disc["Discriminative Model"]
        direction TB
        D1["Learns: P(y|x) directly"]
        D2["'Given features,<br/>what's the class?'"]
        D3["Draws decision boundary"]
    end

    disc --> D_EX["Examples:<br/>• Logistic Regression<br/>• SVM<br/>• Neural Networks<br/>• Random Forest"]

    style disc fill:#6cc3d5,stroke:#333,color:#fff

graph TD
    linkStyle default stroke:#000,color:#000
    subgraph gen["Generative Model"]
        direction TB
        G1["Learns: P(x|y) and P(y)"]
        G2["'What does each<br/>class look like?'"]
        G3["Models full data distribution"]
    end

    gen --> G_EX["Examples:<br/>• Naive Bayes<br/>• Gaussian Mixture Models<br/>• VAE, GANs<br/>• Hidden Markov Models"]

    style gen fill:#56cc9d,stroke:#333,color:#fff

Intuition: Cat vs. Dog Classifier

Discriminative approach: Learn the boundary between cats and dogs. “This side = cat, that side = dog.” Doesn’t know what a cat or dog looks like — just where the line is.

Generative approach: Learn what cats look like (fur patterns, ear shapes) and what dogs look like separately. Classify new images by asking “Does this look more like a cat or a dog?” Can also generate new cat/dog images.

Understanding the Math

Discriminative — models P(y|x) directly:

Asks: “Given these input features x, what is the probability of each class y?”
Example: Given an email’s word frequencies, directly output P(\text{spam} | \text{words}) = 0.87
Learns the decision boundary without modeling how the data was generated

Generative — models P(x|y) \cdot P(y) then applies Bayes’ rule:

P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)}

P(x|y) = likelihood — “What does data from class y look like?” (e.g., what word patterns do spam emails have?)
P(y) = prior — “How common is class y?” (e.g., 20% of all emails are spam)
P(x) = evidence — normalizing constant (same for all classes, often ignored)
To classify: compute P(x|y) \cdot P(y) for each class, pick the highest

Example: Spam Detection — Two Approaches

# Discriminative: Logistic Regression
# Learns P(spam | words) directly
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_tfidf, y_labels)
# Finds the decision boundary in word-frequency space

# Generative: Naive Bayes
# Learns P(words | spam) and P(words | not_spam) separately
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_tfidf, y_labels)
# Models how spam emails "look" vs. how normal emails "look"
# Classifies using Bayes' rule: P(spam|words) ∝ P(words|spam)·P(spam)

Comparison

Aspect	Discriminative	Generative
What it models	P(y\|x) — boundary	P(x\|y)·P(y) — full distribution
Accuracy with enough data	Usually higher	Often lower for classification
Small data performance	Can struggle	Often better (stronger assumptions help)
Can generate new data?	No	Yes
Handles missing features	Poorly	Naturally (marginalize out)
Training efficiency	Focuses only on boundary	Models more than needed for classification

Application

Discriminative: Most production classification tasks (credit scoring, image classification, NLP)
Generative: Data augmentation (GANs), anomaly detection, handling missing data, text generation (GPT), drug discovery
Modern trend: Generative AI (LLMs, diffusion models) uses generative models for creation, while discriminative models remain dominant for classification/prediction tasks

Q8: What is gradient boosting and how does XGBoost work?

Answer:

Gradient boosting sequentially builds an ensemble where each new model corrects the residual errors of the previous ensemble.

graph TD
    linkStyle default stroke:#000,color:#000
    A["Training Data<br/>(X, y)"] --> B["Model 1: Simple tree<br/>Prediction: ŷ₁"]
    B --> C["Compute residuals<br/>r₁ = y - ŷ₁"]
    C --> D["Model 2: Fit residuals r₁<br/>Prediction: ŷ₂"]
    D --> E["Compute residuals<br/>r₂ = y - (ŷ₁ + η·ŷ₂)"]
    E --> F["Model 3: Fit residuals r₂<br/>Prediction: ŷ₃"]
    F --> G["...continue..."]
    G --> H["Final: ŷ = ŷ₁ + η·ŷ₂ + η·ŷ₃ + ..."]

    style A fill:#6cc3d5,stroke:#333,color:#fff
    style H fill:#56cc9d,stroke:#333,color:#fff

How XGBoost Improves Gradient Boosting

graph LR
    linkStyle default stroke:#000,color:#000
    GB["Standard<br/>Gradient Boosting"] --> XGB["XGBoost<br/>Improvements"]

    XGB --> I1["Regularization<br/>(L1 + L2 on<br/>leaf weights)"]
    XGB --> I2["Second-order gradients<br/>(Newton's method<br/>— faster convergence)"]
    XGB --> I3["Column subsampling<br/>(like Random Forest<br/>— reduces overfitting)"]
    XGB --> I4["Built-in missing<br/>value handling<br/>(learns optimal direction)"]
    XGB --> I5["Tree pruning<br/>(max_depth +<br/>gain-based pruning)"]
    XGB --> I6["Parallel feature<br/>computation<br/>(fast training)"]

    style XGB fill:#56cc9d,stroke:#333,color:#fff

Example: House Price Prediction

import xgboost as xgb
from sklearn.model_selection import cross_val_score

model = xgb.XGBRegressor(
    n_estimators=500,        # 500 sequential trees
    max_depth=4,             # shallow trees (high bias per tree)
    learning_rate=0.05,      # shrinkage — small steps
    subsample=0.8,           # 80% of rows per tree
    colsample_bytree=0.8,   # 80% of features per tree
    reg_alpha=0.1,           # L1 regularization
    reg_lambda=1.0,          # L2 regularization
    early_stopping_rounds=50 # stop if no improvement
)

# With eval set for early stopping
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)

# Result: RMSE improved from $45K (single tree) to $18K (XGBoost)
print(f"Best iteration: {model.best_iteration}")  # Stopped at 312 trees

Key Hyperparameters and Tuning Order

Priority	Parameter	Range	Effect
1st	`learning_rate`	0.01-0.3	Lower = more robust but needs more trees
1st	`n_estimators`	100-5000	Use early stopping to find optimal
2nd	`max_depth`	3-8	Controls tree complexity
2nd	`subsample`	0.6-1.0	Row sampling (regularization)
3rd	`colsample_bytree`	0.6-1.0	Feature sampling (regularization)
3rd	`reg_alpha`, `reg_lambda`	0-10	Weight penalties

Application

Kaggle competitions: XGBoost/LightGBM win majority of tabular data competitions
Industry standard: Fraud detection, credit scoring, recommendation ranking
When to use: Tabular data with < 1M rows (for larger data, prefer LightGBM)
When NOT to use: Image/text/audio data (use deep learning), very small data (use simpler models)

Q9: How do you handle missing data?

Answer:

Missing data handling requires understanding why data is missing before choosing a strategy.

graph TD
    linkStyle default stroke:#000,color:#000
    MISSING["Missing Data"] --> TYPE["Understand the type"]

    TYPE --> MCAR["MCAR<br/>Missing Completely<br/>at Random<br/>(no pattern)"]
    TYPE --> MAR["MAR<br/>Missing at Random<br/>(depends on<br/>observed features)"]
    TYPE --> MNAR["MNAR<br/>Missing Not at Random<br/>(depends on the<br/>missing value itself)"]

    MCAR --> MCAR_EX["Example: Sensor<br/>randomly fails<br/>→ Safe to drop or impute"]
    MAR --> MAR_EX["Example: Rich people<br/>skip income question<br/>→ Impute using other features"]
    MNAR --> MNAR_EX["Example: Sick patients<br/>miss appointments<br/>→ Missingness IS informative"]

    style MCAR fill:#56cc9d,stroke:#333,color:#fff
    style MAR fill:#6cc3d5,stroke:#333,color:#fff
    style MNAR fill:#ff7851,stroke:#333,color:#fff

Strategies Decision Tree

graph TD
    linkStyle default stroke:#000,color:#000
    Q1{"How much is missing?"} -->|">50% of column"| DROP_COL["Drop the column"]
    Q1 -->|"<5% of rows"| DROP_ROW["Drop rows<br/>(if MCAR)"]
    Q1 -->|"5-50%"| Q2{"What type of feature?"}

    Q2 -->|"Numerical"| Q3{"Distribution?"}
    Q2 -->|"Categorical"| CAT["Mode or 'Unknown' category"]

    Q3 -->|"Symmetric"| MEAN["Mean imputation"]
    Q3 -->|"Skewed / outliers"| MEDIAN["Median imputation"]
    Q3 -->|"Complex patterns"| MODEL["Model-based<br/>(KNN, Iterative)"]

    DROP_COL --> FLAG["+ Add missingness indicator<br/>if MNAR suspected"]
    MEDIAN --> FLAG

    style FLAG fill:#ffce67,stroke:#333, color:#000
    style MODEL fill:#56cc9d,stroke:#333,color:#fff

Example: Customer Data with Missing Values

import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline

# Dataset:
# age: 2% missing (random sensor error) → MCAR
# income: 15% missing (high earners skip) → MAR
# credit_score: 30% missing (new customers) → MNAR

# Strategy 1: Simple imputation
imputer_age = SimpleImputer(strategy='median')    # robust to outliers
imputer_income = KNNImputer(n_neighbors=5)        # use similar customers
# For credit_score: add a flag + impute

df['credit_score_missing'] = df['credit_score'].isna().astype(int)  # flag
df['credit_score'] = df['credit_score'].fillna(df['credit_score'].median())

# CRITICAL: fit imputers on TRAINING data only!
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X)

imputer = KNNImputer(n_neighbors=5)
X_train_imputed = imputer.fit_transform(X_train)   # fit + transform
X_test_imputed = imputer.transform(X_test)         # only transform!

Common Mistakes

Mistake	Why it’s wrong	Fix
Impute before splitting	Leaks test info into training	Split first, fit imputer on train only
Use mean for skewed data	Mean pulled by outliers	Use median
Drop all missing rows	Loses data + introduces bias	Impute or flag
Ignore MNAR patterns	Loses predictive signal	Add missingness indicator
Impute time series with future	Temporal leakage	Use forward-fill or rolling window

Application

Healthcare: Patient data often has MNAR (sicker patients have more missing tests) — missingness flag is critical
Surveys: Income/age often MAR — use KNN imputer with demographic features
IoT/Sensors: Usually MCAR — simple median/interpolation works
Production systems: Build imputation into the ML pipeline (sklearn Pipeline) so it’s applied consistently at training and inference time

Q10: What is data leakage and how do you prevent it?

Answer:

Data leakage occurs when information that would not be available at prediction time is used during training. It inflates metrics offline but causes catastrophic failure in production.

graph LR
    linkStyle default stroke:#000,color:#000
    subgraph leakage["Data Leakage:<br/>What Happens"]
        direction LR
        L1["Training uses<br/>future/target info"] --> L2["Model gets 99%<br/>accuracy offline"]
        L2 --> L3["Deploy to production"]
        L3 --> L4["Performance drops to 60%<br/>❌ FAILURE"]
    end

    style leakage fill:#ff7851,stroke:#333,color:#fff
    linkStyle default stroke:#000

graph LR
    linkStyle default stroke:#000,color:#000
    subgraph clean["No Leakage:<br/>What Should Happen"]
        direction LR
        C1["Training uses<br/>only available info"] --> C2["Model gets 85%<br/>accuracy offline"]
        C2 --> C3["Deploy to production"]
        C3 --> C4["Performance stays at 83%<br/>✅ SUCCESS"]
    end

    style clean fill:#56cc9d,stroke:#333,color:#fff
    linkStyle default stroke:#000

Intuitive Example: Predicting Hospital Readmission

Imagine you’re building a model to predict whether a patient will be readmitted within 30 days.

Feature	Leakage?	Why
Patient age, diagnosis	✅ Safe	Available at discharge
Length of stay	✅ Safe	Known when patient leaves
“Readmission scheduled” flag	❌ Leakage!	Only exists AFTER readmission happens
Discharge summary mentioning “follow-up in 2 weeks”	⚠️ Subtle leakage	Written by doctor who already decided on readmission plan
Number of future appointments booked	❌ Leakage!	Created after the prediction point

The key question: “Would I have this feature at the moment I need to make the prediction?”

If the answer is no — it’s leakage. The model isn’t learning to predict the future; it’s learning to read the future.

Common Types of Leakage

graph TD
    linkStyle default stroke:#000,color:#000
    LEAK["Data Leakage Types"] --> T1["Target Leakage<br/>(feature derived from target)"]
    LEAK --> T2["Temporal Leakage<br/>(using future data)"]
    LEAK --> T3["Train-Test Contamination<br/>(preprocessing on full data)"]
    LEAK --> T4["Group Leakage<br/>(same entity in train & test)"]

    T1 --> T1_EX["Example: 'diagnosis_code'<br/>predicting 'has_disease'<br/>(code IS the diagnosis!)"]
    T2 --> T2_EX["Example: Using tomorrow's<br/>stock price as a feature<br/>to predict today's"]
    T3 --> T3_EX["Example: Scaling/encoding<br/>fit on full data before split"]
    T4 --> T4_EX["Example: Same patient in<br/>train & test<br/>(memorizes patient, not pattern)"]

    style T1 fill:#ff7851,stroke:#333,color:#fff
    style T2 fill:#ffce67,stroke:#333
    style T3 fill:#6cc3d5,stroke:#333,color:#fff
    style T4 fill:#56cc9d,stroke:#333,color:#fff

Example: Churn Prediction with Leakage

# ❌ LEAKAGE: Feature "days_since_last_login" is computed AFTER the churn event
# If someone churned 30 days ago, days_since_last_login = 30
# The model is just detecting "they already churned" not "they will churn"

# ❌ LEAKAGE: Scaling before splitting
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # fit on ALL data including test
X_train, X_test = train_test_split(X_scaled)
# Test data statistics leaked into scaler!

# ✅ CORRECT: Split first, then preprocess
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit only on train
X_test_scaled = scaler.transform(X_test)        # transform only

Prevention Checklist

graph TD
    linkStyle default stroke:#000,color:#000
    A["Prevention Strategy"] --> B["1. Split FIRST<br/>before any preprocessing"]
    B --> C["2. Validate feature availability<br/>'Would I have this at inference time?'"]
    C --> D["3. Use time-based splits<br/>for temporal data"]
    D --> E["4. Group by entity<br/>(user, patient, store)"]
    E --> F["5. Sanity check<br/>'Is accuracy suspiciously high?'"]
    F --> G["6. Test with shuffled target<br/>(should give ~50% accuracy)"]

    style A fill:#56cc9d,stroke:#333,color:#000
    style B fill:#56cc9d,stroke:#333,color:#000
    style C fill:#56cc9d,stroke:#333,color:#000
    style D fill:#56cc9d,stroke:#333,color:#000
    style E fill:#56cc9d,stroke:#333,color:#000 
    style F fill:#ffce67,stroke:#333,color:#000
    style G fill:#ff7851,stroke:#333,color:#000

Red Flags That Suggest Leakage

Signal	What to check
Accuracy > 95% on first attempt	Too good to be true — inspect features
Single feature dominates importance	May be a proxy for the target
Train and test scores are nearly identical	Model may be seeing test info
Performance drops dramatically in production	Classic leakage symptom
Cross-validation scores are unstable	Leakage present in some folds

Application

Time series: Always use forward-chaining (train on past, predict future). Never shuffle temporal data.
Medical studies: Ensure no patient appears in both train and test sets.
Feature stores: Implement point-in-time correctness — features computed using only data available at prediction time.
ML pipelines: Use sklearn Pipeline to bundle preprocessing + model, ensuring transforms are fit only on training data during cross-validation.

Summary

Question	Core Concept	Key Takeaway
Q1	Precision/Recall/F1	Choose metrics based on error costs, not defaults
Q2	ROC-AUC	Good for ranking; use PR-AUC for imbalanced data
Q3	Imbalanced data	Fix metrics first, then weights, then resampling
Q4	Feature engineering	Better features beat better models — invest here
Q5	Curse of dimensionality	High dimensions break distance; reduce or regularize
Q6	PCA	Find maximum-variance directions; scale first
Q7	Generative vs. Discriminative	Discriminative for classification; generative for creation
Q8	Gradient Boosting/XGBoost	Sequential error correction; king of tabular data
Q9	Missing data	Understand WHY it’s missing before choosing how to fix
Q10	Data leakage	Split first; validate feature availability at inference time

Previous: ML Interview QA - 1 covers learning paradigms, bias-variance, overfitting, regularization, gradient descent, cross-validation, logistic regression, decision trees, Random Forest, and bagging vs. boosting.

ML Interview QA - 1 Home

Enjoyed this article?

If this article helped you, your support helps us deliver more useful content. Here are a few ways to support our work:

Subscribe to Vectoring AI on YouTube
Share this article with your networks
Support with a coffee