<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Vectoring AI</title>
<link>https://vectoringai.com/pages/ml-interview.html</link>
<atom:link href="https://vectoringai.com/pages/ml-interview.xml" rel="self" type="application/rss+xml"/>
<description>Machine learning interview questions, concepts, and preparation guides covering algorithms, system design, and practical ML engineering.</description>
<generator>quarto-1.9.36</generator>
<lastBuildDate>Sun, 17 May 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>ML Interview QA - 1</title>
  <dc:creator>Vectoring AI</dc:creator>
  <link>https://vectoringai.com/posts/ml-interview/ML-Interview-QA-1.html</link>
  <description><![CDATA[ 




<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>This is <strong>Part 1</strong> of our ML Interview QA series. It covers 10 foundational questions that appear in nearly every ML Engineer, Data Scientist, and Applied AI interview — from startups to FAANG. Each answer goes beyond surface-level definitions with diagrams, concrete examples, and real-world applications.</p>
<blockquote class="blockquote">
<p>For evaluation metrics, feature engineering, and data handling questions, see <a href="../../posts/ml-interview/ML-Interview-QA-2.html">ML Interview QA - 2</a>.</p>
</blockquote>
<hr>
</section>
<section id="q1-what-is-the-difference-between-supervised-unsupervised-and-reinforcement-learning" class="level2">
<h2 class="anchored" data-anchor-id="q1-what-is-the-difference-between-supervised-unsupervised-and-reinforcement-learning">Q1: What is the difference between supervised, unsupervised, and reinforcement learning?</h2>
<p><strong>Answer:</strong></p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    ML["Machine Learning"] --&gt; SUP["Supervised Learning"]
    ML --&gt; UNSUP["Unsupervised Learning"]
    ML --&gt; RL["Reinforcement Learning"]

    SUP --&gt; SUP_IN["Input: Labeled data&lt;br/&gt;(X, y) pairs"]
    SUP --&gt; SUP_GOAL["Goal: Learn mapping f(X) → y"]
    SUP --&gt; SUP_EX["Examples:&lt;br/&gt;• Classification&lt;br/&gt;• Regression"]

    UNSUP --&gt; UNSUP_IN["Input: Unlabeled data&lt;br/&gt;(X only)"]
    UNSUP --&gt; UNSUP_GOAL["Goal: Find hidden structure"]
    UNSUP --&gt; UNSUP_EX["Examples:&lt;br/&gt;• Clustering&lt;br/&gt;• Dimensionality Reduction"]

    RL --&gt; RL_IN["Input: Environment + Rewards"]
    RL --&gt; RL_GOAL["Goal: Maximize cumulative reward"]
    RL --&gt; RL_EX["Examples:&lt;br/&gt;• Game Playing&lt;br/&gt;• Robotics"]

    style SUP fill:#56cc9d,stroke:#333,color:#fff
    style UNSUP fill:#6cc3d5,stroke:#333,color:#fff
    style RL fill:#ffce67,stroke:#333
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="detailed-breakdown" class="level3">
<h3 class="anchored" data-anchor-id="detailed-breakdown">Detailed Breakdown</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 16%">
<col style="width: 22%">
<col style="width: 29%">
<col style="width: 31%">
</colgroup>
<thead>
<tr class="header">
<th>Aspect</th>
<th>Supervised</th>
<th>Unsupervised</th>
<th>Reinforcement</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Data</td>
<td>Labeled (X, y)</td>
<td>Unlabeled (X)</td>
<td>States, actions, rewards</td>
</tr>
<tr class="even">
<td>Feedback</td>
<td>Direct (correct answers)</td>
<td>None</td>
<td>Delayed (reward signal)</td>
</tr>
<tr class="odd">
<td>Goal</td>
<td>Predict outcome</td>
<td>Discover structure</td>
<td>Maximize long-term reward</td>
</tr>
<tr class="even">
<td>Evaluation</td>
<td>Compare predictions to labels</td>
<td>Internal metrics (silhouette, inertia)</td>
<td>Cumulative reward</td>
</tr>
</tbody>
</table>
</section>
<section id="example-e-commerce-company" class="level3">
<h3 class="anchored" data-anchor-id="example-e-commerce-company">Example: E-commerce Company</h3>
<p>Imagine you work at an e-commerce company:</p>
<ul>
<li><strong>Supervised:</strong> Predict whether a user will buy a product given their browsing history (you have past purchase labels).</li>
<li><strong>Unsupervised:</strong> Segment customers into groups based on behavior patterns (no predefined groups).</li>
<li><strong>Reinforcement:</strong> Train a recommendation agent that learns which products to show to maximize click-through rate over time.</li>
</ul>
</section>
<section id="applications" class="level3">
<h3 class="anchored" data-anchor-id="applications">Applications</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 21%">
<col style="width: 78%">
</colgroup>
<thead>
<tr class="header">
<th>Type</th>
<th>Industry Applications</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Supervised</td>
<td>Spam detection, credit scoring, medical diagnosis, price prediction</td>
</tr>
<tr class="even">
<td>Unsupervised</td>
<td>Customer segmentation, anomaly detection, topic modeling, gene clustering</td>
</tr>
<tr class="odd">
<td>Reinforcement</td>
<td>Autonomous driving, game AI (AlphaGo), ad bidding, inventory management</td>
</tr>
</tbody>
</table>
</section>
<section id="when-to-choose" class="level3">
<h3 class="anchored" data-anchor-id="when-to-choose">When to Choose</h3>
<ul>
<li><strong>Use supervised</strong> when you have labeled data and a clear target variable.</li>
<li><strong>Use unsupervised</strong> when you want to explore data structure without predefined categories.</li>
<li><strong>Use reinforcement</strong> when the problem involves sequential decisions with delayed feedback.</li>
</ul>
<hr>
</section>
</section>
<section id="q2-what-is-the-bias-variance-tradeoff" class="level2">
<h2 class="anchored" data-anchor-id="q2-what-is-the-bias-variance-tradeoff">Q2: What is the bias-variance tradeoff?</h2>
<p><strong>Answer:</strong></p>
<p>The total prediction error of a model decomposes into three components:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BTotal%20Error%7D%20=%20%5Ctext%7BBias%7D%5E2%20+%20%5Ctext%7BVariance%7D%20+%20%5Ctext%7BIrreducible%20Noise%7D"></p>
<section id="what-is-bias" class="level3">
<h3 class="anchored" data-anchor-id="what-is-bias">What is Bias?</h3>
<p><strong>Bias</strong> measures how far off a model’s average predictions are from the true values. It reflects the <strong>systematic error</strong> introduced by simplifying assumptions in the model.</p>
<p><strong>Example:</strong> Imagine predicting house prices in a neighborhood where prices increase exponentially with size. If you fit a straight line (linear model), your predictions will consistently miss the curve — underestimating large houses and overestimating small ones. That consistent miss is bias.</p>
<ul>
<li><strong>High bias</strong> = the model is too simple to capture the real relationship (underfitting)</li>
<li><strong>Low bias</strong> = the model’s average prediction is close to the true value</li>
</ul>
</section>
<section id="what-is-variance" class="level3">
<h3 class="anchored" data-anchor-id="what-is-variance">What is Variance?</h3>
<p><strong>Variance</strong> measures how much a model’s predictions <strong>change</strong> when trained on different subsets of data. It reflects sensitivity to the specific training data used.</p>
<p><strong>Example:</strong> Imagine training a very complex polynomial model on 100 houses. Now retrain it on a <em>different</em> random sample of 100 houses from the same city. If the two models give wildly different predictions for the same house, that’s high variance — the model is memorizing quirks of each specific sample rather than learning stable patterns.</p>
<ul>
<li><strong>High variance</strong> = the model changes drastically with different training data (overfitting)</li>
<li><strong>Low variance</strong> = the model produces consistent predictions regardless of which training subset is used</li>
</ul>
</section>
<section id="what-is-irreducible-noise" class="level3">
<h3 class="anchored" data-anchor-id="what-is-irreducible-noise">What is Irreducible Noise?</h3>
<p><strong>Irreducible noise</strong> (also called Bayes error) is the inherent randomness in the data that <strong>no model can eliminate</strong>, no matter how perfect.</p>
<p><strong>Example:</strong> Two identical houses (same size, location, age, condition) sell for different prices because one buyer was emotionally attached to the neighborhood and overpaid. This randomness — caused by unmeasured factors, human behavior, or measurement error — sets a floor on prediction error.</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BIrreducible%20Noise%7D%20=%20%5Ctext%7BVar%7D(%5Cepsilon)"></p>
<p>You cannot reduce it by improving your model. The only way to lower it is to collect better or more informative features.</p>
</section>
<section id="the-tradeoff" class="level3">
<h3 class="anchored" data-anchor-id="the-tradeoff">The Tradeoff</h3>
<p>The <strong>tradeoff</strong>: as you increase model complexity, bias decreases (the model can capture more patterns) but variance increases (the model becomes more sensitive to training data). The goal is to find the sweet spot that minimizes total error.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph high_bias["High Bias (Underfitting)"]
        direction TB
        HB1["Simple model"]
        HB2["Misses the true pattern"]
        HB3["Both train &amp; val error HIGH"]
    end

    subgraph sweet["Sweet Spot"]
        direction TB
        SS1["Right complexity"]
        SS2["Captures pattern, ignores noise"]
        SS3["Both errors LOW"]
    end

    subgraph high_var["High Variance (Overfitting)"]
        direction TB
        HV1["Complex model"]
        HV2["Fits noise in training data"]
        HV3["Train error LOW, val error HIGH"]
    end

    high_bias --&gt;|"Increase complexity"| sweet
    sweet --&gt;|"Increase complexity"| high_var

    style high_bias fill:#ff7851,stroke:#333,color:#fff
    style sweet fill:#56cc9d,stroke:#333,color:#fff
    style high_var fill:#ffce67,stroke:#333
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="intuition-with-a-concrete-example" class="level3">
<h3 class="anchored" data-anchor-id="intuition-with-a-concrete-example">Intuition with a Concrete Example</h3>
<p><strong>Scenario:</strong> Predict house prices from square footage.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 15%">
<col style="width: 29%">
<col style="width: 13%">
<col style="width: 22%">
<col style="width: 18%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>What it does</th>
<th>Bias</th>
<th>Variance</th>
<th>Result</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Linear (1 feature)</td>
<td>Fits a straight line</td>
<td>High — misses non-linear patterns</td>
<td>Low — stable across datasets</td>
<td>Underfits</td>
</tr>
<tr class="even">
<td>Polynomial (degree 2-3)</td>
<td>Fits gentle curves</td>
<td>Low — captures the pattern</td>
<td>Medium — somewhat stable</td>
<td>Good fit</td>
</tr>
<tr class="odd">
<td>Polynomial (degree 20)</td>
<td>Fits every data point</td>
<td>Very low — passes through all points</td>
<td>Very high — completely different on new data</td>
<td>Overfits</td>
</tr>
</tbody>
</table>
</section>
<section id="visual-analogy-dartboard" class="level3">
<h3 class="anchored" data-anchor-id="visual-analogy-dartboard">Visual Analogy: Dartboard</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph low_b_low_v["Low Bias, Low Variance ✅ IDEAL"]
        A1["🎯 Darts clustered at center"]
    end
    subgraph low_b_high_v["Low Bias, High Variance"]
        A2["🎯 Darts scattered around center"]
    end

    style low_b_low_v fill:#56cc9d,stroke:#333,color:#fff
    style low_b_high_v fill:#6cc3d5,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph high_b_low_v["High Bias, Low Variance"]
        A3["🎯 Darts clustered away from center"]
    end
    subgraph high_b_high_v["High Bias, High Variance"]
        A4["🎯 Darts scattered away from center"]
    end

    style high_b_low_v fill:#ffce67,stroke:#333
    style high_b_high_v fill:#ff7851,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<ul>
<li><strong>Bias</strong> = how far the average prediction is from the true value (accuracy)</li>
<li><strong>Variance</strong> = how much predictions scatter across different training sets (consistency)</li>
</ul>
</section>
<section id="how-to-diagnose-and-fix" class="level3">
<h3 class="anchored" data-anchor-id="how-to-diagnose-and-fix">How to Diagnose and Fix</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 33%">
<col style="width: 40%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="header">
<th>Symptom</th>
<th>Diagnosis</th>
<th>Fixes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>High train error + high val error</td>
<td>High bias (underfitting)</td>
<td>More features, more complex model, less regularization</td>
</tr>
<tr class="even">
<td>Low train error + high val error</td>
<td>High variance (overfitting)</td>
<td>More data, regularization, simpler model, dropout</td>
</tr>
<tr class="odd">
<td>Low train error + low val error</td>
<td>Good balance</td>
<td>Deploy! Monitor for drift.</td>
</tr>
</tbody>
</table>
</section>
<section id="application" class="level3">
<h3 class="anchored" data-anchor-id="application">Application</h3>
<p>In production ML systems, you continuously monitor this tradeoff:</p>
<ul>
<li><strong>Credit scoring:</strong> High bias means denying good borrowers; high variance means approving risky ones on certain data splits.</li>
<li><strong>Medical imaging:</strong> High bias misses tumors; high variance gives false positives on certain patient populations.</li>
</ul>
<hr>
</section>
</section>
<section id="q3-what-is-overfitting-and-how-do-you-prevent-it" class="level2">
<h2 class="anchored" data-anchor-id="q3-what-is-overfitting-and-how-do-you-prevent-it">Q3: What is overfitting and how do you prevent it?</h2>
<p><strong>Answer:</strong></p>
<p>Overfitting occurs when a model learns the <strong>noise and specific quirks</strong> of training data rather than the underlying pattern. It performs excellently on training data but poorly on unseen data.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph training["Training Phase"]
        T1["Model sees data points"]
        T2["Learns true pattern ✅"]
        T3["Also memorizes noise ❌"]
        T1 --&gt; T2
        T1 --&gt; T3
    end

    subgraph result["Result"]
        R1["Training accuracy: 99%"]
        R2["Validation accuracy: 72%"]
        R3["GAP = Overfitting signal"]
        R1 --&gt; R3
        R2 --&gt; R3
    end

    training --&gt; result

    style T2 fill:#56cc9d,stroke:#333,color:#fff
    style T3 fill:#ff7851,stroke:#333,color:#fff
    style R3 fill:#ffce67,stroke:#333
    style training fill:#6cc3d5,stroke:#333,color:#fff
    style result fill:#6cc3d5,stroke:#333,color:#fff

</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="example-spam-detection" class="level3">
<h3 class="anchored" data-anchor-id="example-spam-detection">Example: Spam Detection</h3>
<p>You train a spam classifier on 1000 emails. An overfit model might learn:</p>
<ul>
<li>“Emails from john@company.com sent at 3:14 PM on Tuesday are spam” (memorizing specific instances)</li>
<li>Instead of: “Emails containing ‘free money’ + suspicious links are spam” (learning the pattern)</li>
</ul>
<p>When new spam arrives from different senders, the overfit model fails.</p>
</section>
<section id="prevention-techniques-ordered-by-priority" class="level3">
<h3 class="anchored" data-anchor-id="prevention-techniques-ordered-by-priority">Prevention Techniques (ordered by priority)</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    A["Overfitting Detected"] --&gt; B["1. Get more data&lt;br/&gt;(best fix if possible)"]
    B --&gt; C["2. Regularization&lt;br/&gt;(L1/L2 penalty)"]
    C --&gt; D["3. Early stopping&lt;br/&gt;(stop before memorizing)"]
    D --&gt; E["4. Cross-validation&lt;br/&gt;(reliable evaluation)"]
    E --&gt; F["5. Dropout&lt;br/&gt;(neural networks)"]
    F --&gt; G["6. Feature selection&lt;br/&gt;(remove noise features)"]
    G --&gt; H["7. Ensemble methods&lt;br/&gt;(bagging reduces variance)"]

    style A fill:#ff7851,stroke:#333,color:#fff
    style B fill:#56cc9d,stroke:#333,color:#fff
    style C fill:#56cc9d,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="detailed-example-early-stopping" class="level3">
<h3 class="anchored" data-anchor-id="detailed-example-early-stopping">Detailed Example: Early Stopping</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Training a neural network with early stopping</span></span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.neural_network <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> MLPClassifier</span>
<span id="cb1-3"></span>
<span id="cb1-4">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MLPClassifier(</span>
<span id="cb1-5">    hidden_layer_sizes<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>),</span>
<span id="cb1-6">    max_iter<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>,</span>
<span id="cb1-7">    early_stopping<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ← monitor validation loss</span></span>
<span id="cb1-8">    validation_fraction<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>,   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ← hold out 10% for monitoring</span></span>
<span id="cb1-9">    n_iter_no_change<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ← stop if no improvement for 10 epochs</span></span>
<span id="cb1-10">)</span></code></pre></div></div>
<p><strong>What happens:</strong> Training stops at epoch 47 (where validation loss was lowest) instead of epoch 1000 (where training loss would be near zero but validation loss has increased).</p>
</section>
<section id="when-to-worry-about-overfitting" class="level3">
<h3 class="anchored" data-anchor-id="when-to-worry-about-overfitting">When to Worry About Overfitting</h3>
<ul>
<li>Small datasets with many features (curse of dimensionality)</li>
<li>Very deep decision trees or large neural networks</li>
<li>Training for too many epochs</li>
<li>No regularization applied</li>
<li>Data leakage inflating apparent performance (e.g., using future information during training, including target-derived features, or applying preprocessing like scaling/encoding on the full dataset before splitting — this makes the model appear to perform well in development but fail in production because it had access to information it wouldn’t have at inference time)</li>
</ul>
<hr>
</section>
</section>
<section id="q4-explain-the-difference-between-l1-and-l2-regularization." class="level2">
<h2 class="anchored" data-anchor-id="q4-explain-the-difference-between-l1-and-l2-regularization.">Q4: Explain the difference between L1 and L2 regularization.</h2>
<p><strong>Answer:</strong></p>
<p>Regularization adds a <strong>penalty term</strong> to the loss function to discourage overly complex models.</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BL1%20(Lasso):%7D%20%5Cquad%20%5Cmathcal%7BL%7D_%7Btotal%7D%20=%20%5Cmathcal%7BL%7D_%7Bdata%7D%20+%20%5Clambda%20%5Csum_%7Bi%7D%20%7Cw_i%7C"></p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BL2%20(Ridge):%7D%20%5Cquad%20%5Cmathcal%7BL%7D_%7Btotal%7D%20=%20%5Cmathcal%7BL%7D_%7Bdata%7D%20+%20%5Clambda%20%5Csum_%7Bi%7D%20w_i%5E2"></p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph L1["L1 Regularization (Lasso)"]
        direction TB
        L1A["Penalty: λΣ|w|"]
        L1B["Diamond-shaped constraint"]
        L1C["Pushes weights to EXACTLY zero"]
        L1D["Result: Sparse model&lt;br/&gt;(automatic feature selection)"]
        L1A --&gt; L1B --&gt; L1C --&gt; L1D
    end

    subgraph L2["L2 Regularization (Ridge)"]
        direction TB
        L2A["Penalty: λΣw²"]
        L2B["Circular constraint"]
        L2C["Shrinks ALL weights toward zero"]
        L2D["Result: Small but non-zero weights&lt;br/&gt;(handles multicollinearity)"]
        L2A --&gt; L2B --&gt; L2C --&gt; L2D
    end

    style L1 fill:#56cc9d,stroke:#333,color:#fff
    style L2 fill:#6cc3d5,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="example-predicting-house-prices-with-50-features" class="level3">
<h3 class="anchored" data-anchor-id="example-predicting-house-prices-with-50-features">Example: Predicting House Prices with 50 Features</h3>
<p>Suppose you have 50 features including relevant ones (sqft, bedrooms) and irrelevant ones (owner’s birthday, day of listing):</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.linear_model <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Lasso, Ridge</span>
<span id="cb2-2"></span>
<span id="cb2-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># L1 — Lasso: drives irrelevant feature weights to zero</span></span>
<span id="cb2-4">lasso <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Lasso(alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>)</span>
<span id="cb2-5">lasso.fit(X_train, y_train)</span>
<span id="cb2-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Result: 12 features have non-zero weights, 38 are exactly zero</span></span>
<span id="cb2-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># → Automatic feature selection!</span></span>
<span id="cb2-8"></span>
<span id="cb2-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># L2 — Ridge: shrinks all weights but keeps them non-zero</span></span>
<span id="cb2-10">ridge <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Ridge(alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>)</span>
<span id="cb2-11">ridge.fit(X_train, y_train)</span>
<span id="cb2-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Result: all 50 features have small non-zero weights</span></span>
<span id="cb2-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># → Better when all features contribute a little</span></span></code></pre></div></div>
</section>
<section id="geometric-intuition" class="level3">
<h3 class="anchored" data-anchor-id="geometric-intuition">Geometric Intuition</h3>
<p>The L1 penalty creates a diamond-shaped constraint region. The loss function’s contours are more likely to intersect the diamond at a <strong>corner</strong> (where some weights = 0). The L2 penalty creates a circular constraint, so intersections happen away from axes (weights shrink but don’t reach zero).</p>
</section>
<section id="elastic-net-combining-l1-and-l2" class="level3">
<h3 class="anchored" data-anchor-id="elastic-net-combining-l1-and-l2">Elastic Net: Combining L1 and L2</h3>
<p><strong>Elastic Net</strong> blends both penalties, giving you feature selection (from L1) and stability with correlated features (from L2):</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BElastic%20Net:%7D%20%5Cquad%20%5Cmathcal%7BL%7D_%7Btotal%7D%20=%20%5Cmathcal%7BL%7D_%7Bdata%7D%20+%20%5Clambda%20%5Cleft(%20%5Calpha%20%5Csum_%7Bi%7D%20%7Cw_i%7C%20+%20(1%20-%20%5Calpha)%20%5Csum_%7Bi%7D%20w_i%5E2%20%5Cright)"></p>
<p>Here <img src="https://latex.codecogs.com/png.latex?%5Clambda"> controls the overall regularization strength, and the mixing ratio <img src="https://latex.codecogs.com/png.latex?%5Calpha%20%5Cin%20%5B0,%201%5D"> controls the balance between L1 and L2: <img src="https://latex.codecogs.com/png.latex?%5Calpha%20=%201"> is pure Lasso, <img src="https://latex.codecogs.com/png.latex?%5Calpha%20=%200"> is pure Ridge. In practice, Elastic Net is preferred when you have groups of correlated features — it selects or drops them together rather than picking one arbitrarily as Lasso does.</p>
</section>
<section id="comparison-table" class="level3">
<h3 class="anchored" data-anchor-id="comparison-table">Comparison Table</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 18%">
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 30%">
</colgroup>
<thead>
<tr class="header">
<th>Aspect</th>
<th>L1 (Lasso)</th>
<th>L2 (Ridge)</th>
<th>Elastic Net</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Penalty shape</td>
<td>Diamond</td>
<td>Circle</td>
<td>Blend</td>
</tr>
<tr class="even">
<td>Feature selection</td>
<td>Yes (sparse)</td>
<td>No</td>
<td>Partial</td>
</tr>
<tr class="odd">
<td>Correlated features</td>
<td>Picks one arbitrarily</td>
<td>Distributes weight evenly</td>
<td>Handles well</td>
</tr>
<tr class="even">
<td>Computation</td>
<td>May need special solver</td>
<td>Closed-form solution</td>
<td>Iterative</td>
</tr>
<tr class="odd">
<td>Best for</td>
<td>High-dimensional, many irrelevant features</td>
<td>Multicollinearity, all features relevant</td>
<td>Best of both worlds</td>
</tr>
</tbody>
</table>
</section>
<section id="applications-1" class="level3">
<h3 class="anchored" data-anchor-id="applications-1">Applications</h3>
<ul>
<li><strong>L1:</strong> Genomics (select 50 relevant genes from 20,000), text classification (select key words)</li>
<li><strong>L2:</strong> Financial models (correlated market features), image processing (all pixels matter)</li>
<li><strong>Elastic Net:</strong> When you want both feature selection and stability with correlated features</li>
</ul>
<hr>
</section>
</section>
<section id="q5-what-is-gradient-descent-and-what-are-its-variants" class="level2">
<h2 class="anchored" data-anchor-id="q5-what-is-gradient-descent-and-what-are-its-variants">Q5: What is gradient descent and what are its variants?</h2>
<p><strong>Answer:</strong></p>
<p>Gradient descent is the fundamental optimization algorithm that iteratively adjusts model parameters to minimize a loss function by moving in the direction of steepest descent.</p>
<p><img src="https://latex.codecogs.com/png.latex?w_%7Bt+1%7D%20=%20w_t%20-%20%5Ceta%20%5Ccdot%20%5Cnabla_w%20%5Cmathcal%7BL%7D(w_t)"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Ceta"> is the learning rate and <img src="https://latex.codecogs.com/png.latex?%5Cnabla_w%20%5Cmathcal%7BL%7D"> is the gradient of the loss with respect to weights.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    A["Initialize weights randomly"] --&gt; B["Compute loss L(w)"]
    B --&gt; C["Compute gradient ∂L/∂w"]
    C --&gt; D["Update: w = w - η · ∂L/∂w"]
    D --&gt; E{"Converged?"}
    E --&gt;|No| B
    E --&gt;|Yes| F["Final weights"]

    style A fill:#6cc3d5,stroke:#333,color:#fff
    style D fill:#56cc9d,stroke:#333,color:#fff
    style F fill:#ffce67,stroke:#333
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="concrete-example-quadratic-loss-function" class="level3">
<h3 class="anchored" data-anchor-id="concrete-example-quadratic-loss-function">Concrete Example: Quadratic Loss Function</h3>
<p>Let’s walk through gradient descent step-by-step on a simple quadratic loss:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D(w)%20=%20(w%20-%203)%5E2"></p>
<p>The minimum is at <img src="https://latex.codecogs.com/png.latex?w%20=%203"> where <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D%20=%200">. The gradient is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Cpartial%20%5Cmathcal%7BL%7D%7D%7B%5Cpartial%20w%7D%20=%202(w%20-%203)"></p>
<p><strong>Setup:</strong> Initial weight <img src="https://latex.codecogs.com/png.latex?w_0%20=%200">, learning rate <img src="https://latex.codecogs.com/png.latex?%5Ceta%20=%200.1"></p>
<table class="caption-top table">
<colgroup>
<col style="width: 5%">
<col style="width: 7%">
<col style="width: 19%">
<col style="width: 22%">
<col style="width: 45%">
</colgroup>
<thead>
<tr class="header">
<th>Step</th>
<th><img src="https://latex.codecogs.com/png.latex?w_t"></th>
<th><img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BL%7D(w_t)"></th>
<th>Gradient <img src="https://latex.codecogs.com/png.latex?2(w_t%20-%203)"></th>
<th>Update <img src="https://latex.codecogs.com/png.latex?w_%7Bt+1%7D%20=%20w_t%20-%20%5Ceta%20%5Ccdot%20%5Ctext%7Bgrad%7D"></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>0</td>
<td>0.000</td>
<td>9.000</td>
<td>-6.000</td>
<td><img src="https://latex.codecogs.com/png.latex?0%20-%200.1%20%5Ctimes%20(-6)%20=%200.600"></td>
</tr>
<tr class="even">
<td>1</td>
<td>0.600</td>
<td>5.760</td>
<td>-4.800</td>
<td><img src="https://latex.codecogs.com/png.latex?0.6%20-%200.1%20%5Ctimes%20(-4.8)%20=%201.080"></td>
</tr>
<tr class="odd">
<td>2</td>
<td>1.080</td>
<td>3.686</td>
<td>-3.840</td>
<td><img src="https://latex.codecogs.com/png.latex?1.08%20-%200.1%20%5Ctimes%20(-3.84)%20=%201.464"></td>
</tr>
<tr class="even">
<td>3</td>
<td>1.464</td>
<td>2.359</td>
<td>-3.072</td>
<td><img src="https://latex.codecogs.com/png.latex?1.464%20-%200.1%20%5Ctimes%20(-3.072)%20=%201.771"></td>
</tr>
<tr class="odd">
<td>4</td>
<td>1.771</td>
<td>1.510</td>
<td>-2.458</td>
<td><img src="https://latex.codecogs.com/png.latex?1.771%20-%200.1%20%5Ctimes%20(-2.458)%20=%202.017"></td>
</tr>
<tr class="even">
<td>5</td>
<td>2.017</td>
<td>0.966</td>
<td>-1.966</td>
<td><img src="https://latex.codecogs.com/png.latex?2.017%20-%200.1%20%5Ctimes%20(-1.966)%20=%202.214"></td>
</tr>
<tr class="odd">
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
<td>…</td>
</tr>
<tr class="even">
<td>20</td>
<td>2.965</td>
<td>0.0012</td>
<td>-0.069</td>
<td>→ converging to <img src="https://latex.codecogs.com/png.latex?w%20=%203"></td>
</tr>
</tbody>
</table>
<p><strong>Key observations:</strong></p>
<ul>
<li>The loss decreases monotonically: <img src="https://latex.codecogs.com/png.latex?9.0%20%E2%86%92%205.76%20%E2%86%92%203.69%20%E2%86%92%202.36%20%E2%86%92%20...%E2%86%92%200"></li>
<li>The gradient magnitude shrinks as we approach the minimum (steps get smaller)</li>
<li>With <img src="https://latex.codecogs.com/png.latex?%5Ceta%20=%200.1">, convergence is steady. With <img src="https://latex.codecogs.com/png.latex?%5Ceta%20=%200.9">, it would oscillate; with <img src="https://latex.codecogs.com/png.latex?%5Ceta%20=%201.0">, it would diverge</li>
</ul>
</section>
<section id="variants-compared" class="level3">
<h3 class="anchored" data-anchor-id="variants-compared">Variants Compared</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph batch["Batch GD"]
        direction TB
        B1["Uses ALL data points"]
        B2["1 update per epoch"]
        B3["Smooth but slow"]
    end

    subgraph sgd["Stochastic GD"]
        direction TB
        S1["Uses 1 random point"]
        S2["N updates per epoch"]
        S3["Noisy but fast"]
    end

    subgraph mini["Mini-Batch GD"]
        direction TB
        M1["Uses batch of 32-256"]
        M2["N/batch_size updates"]
        M3["Best of both worlds"]
    end

    batch --&gt; sgd --&gt; mini

    style batch fill:#ff7851,stroke:#333,color:#fff
    style sgd fill:#ffce67,stroke:#333
    style mini fill:#56cc9d,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="example-training-a-linear-regression" class="level3">
<h3 class="anchored" data-anchor-id="example-training-a-linear-regression">Example: Training a Linear Regression</h3>
<p>Dataset: 1 million house prices.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Variant</th>
<th>Computation per update</th>
<th>Updates per epoch</th>
<th>Convergence</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Batch GD</td>
<td>Processes all 1M samples</td>
<td>1</td>
<td>Smooth but very slow</td>
</tr>
<tr class="even">
<td>SGD</td>
<td>Processes 1 sample</td>
<td>1,000,000</td>
<td>Very noisy, may not converge</td>
</tr>
<tr class="odd">
<td>Mini-batch (256)</td>
<td>Processes 256 samples</td>
<td>~3,906</td>
<td>Good balance</td>
</tr>
</tbody>
</table>
</section>
<section id="learning-rate-impact" class="level3">
<h3 class="anchored" data-anchor-id="learning-rate-impact">Learning Rate Impact</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph too_small["η too small"]
        TS["Very slow convergence&lt;br/&gt;May get stuck in local minima&lt;br/&gt;Wastes compute"]
    end
    subgraph just_right["η just right"]
        JR["Steady convergence&lt;br/&gt;Finds good minimum&lt;br/&gt;Efficient training"]
    end
    subgraph too_large["η too large"]
        TL["Overshoots minimum&lt;br/&gt;Oscillates or diverges&lt;br/&gt;Loss increases!"]
    end

    style too_small fill:#6cc3d5,stroke:#333,color:#fff
    style just_right fill:#56cc9d,stroke:#333,color:#fff
    style too_large fill:#ff7851,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="modern-optimizers" class="level3">
<h3 class="anchored" data-anchor-id="modern-optimizers">Modern Optimizers</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 32%">
<col style="width: 29%">
<col style="width: 38%">
</colgroup>
<thead>
<tr class="header">
<th>Optimizer</th>
<th>Key Idea</th>
<th>When to Use</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>SGD + Momentum</td>
<td>Accumulates velocity to accelerate through flat regions</td>
<td>Simple models, well-tuned settings</td>
</tr>
<tr class="even">
<td>AdaGrad</td>
<td>Adapts learning rate per parameter (smaller for frequent features)</td>
<td>Sparse data (NLP, recommenders)</td>
</tr>
<tr class="odd">
<td>RMSProp</td>
<td>Like AdaGrad but uses moving average to avoid shrinking too fast</td>
<td>RNNs, non-stationary objectives</td>
</tr>
<tr class="even">
<td><strong>Adam</strong></td>
<td>Combines momentum + adaptive rates</td>
<td><strong>Default choice</strong> for most deep learning</td>
</tr>
</tbody>
</table>
</section>
<section id="application-1" class="level3">
<h3 class="anchored" data-anchor-id="application-1">Application</h3>
<ul>
<li><strong>Deep learning:</strong> Adam with learning rate scheduling (warm-up + cosine decay)</li>
<li><strong>Convex problems:</strong> Batch GD with line search guarantees convergence</li>
<li><strong>Large-scale production:</strong> Mini-batch SGD with distributed training across GPUs</li>
</ul>
<hr>
</section>
</section>
<section id="q6-what-is-cross-validation-and-why-is-it-important" class="level2">
<h2 class="anchored" data-anchor-id="q6-what-is-cross-validation-and-why-is-it-important">Q6: What is cross-validation and why is it important?</h2>
<p><strong>Answer:</strong></p>
<p>Cross-validation provides a <strong>robust estimate</strong> of model performance by training and evaluating on multiple different splits of the data.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    subgraph fold1["Fold 1"]
        direction LR
        F1_VAL["Val"] --- F1_T1["Train"] --- F1_T2["Train"] --- F1_T3["Train"] --- F1_T4["Train"]
    end
    subgraph fold2["Fold 2"]
        direction LR
        F2_T1["Train"] --- F2_VAL["Val"] --- F2_T2["Train"] --- F2_T3["Train"] --- F2_T4["Train"]
    end
    subgraph fold3["Fold 3"]
        direction LR
        F3_T1["Train"] --- F3_T2["Train"] --- F3_VAL["Val"] --- F3_T3["Train"] --- F3_T4["Train"]
    end
    subgraph fold4["Fold 4"]
        direction LR
        F4_T1["Train"] --- F4_T2["Train"] --- F4_T3["Train"] --- F4_VAL["Val"] --- F4_T4["Train"]
    end
    subgraph fold5["Fold 5"]
        direction LR
        F5_T1["Train"] --- F5_T2["Train"] --- F5_T3["Train"] --- F5_T4["Train"] --- F5_VAL["Val"]
    end

    fold1 --&gt; R1["Score: 0.85"]
    fold2 --&gt; R2["Score: 0.82"]
    fold3 --&gt; R3["Score: 0.87"]
    fold4 --&gt; R4["Score: 0.83"]
    fold5 --&gt; R5["Score: 0.86"]

    R1 --&gt; AVG["Average: 0.846 ± 0.019"]
    R2 --&gt; AVG
    R3 --&gt; AVG
    R4 --&gt; AVG
    R5 --&gt; AVG

    style F1_VAL fill:#ffce67,stroke:#333
    style F2_VAL fill:#ffce67,stroke:#333
    style F3_VAL fill:#ffce67,stroke:#333
    style F4_VAL fill:#ffce67,stroke:#333
    style F5_VAL fill:#ffce67,stroke:#333
    style AVG fill:#56cc9d,stroke:#333,color:#fff
    style fold1 fill:#6cc3d5,stroke:#333,color:#fff
    style fold2 fill:#6cc3d5,stroke:#333,color:#fff
    style fold3 fill:#6cc3d5,stroke:#333,color:#fff
    style fold4 fill:#6cc3d5,stroke:#333,color:#fff
    style fold5 fill:#6cc3d5,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="why-a-single-traintest-split-is-dangerous" class="level3">
<h3 class="anchored" data-anchor-id="why-a-single-traintest-split-is-dangerous">Why a Single Train/Test Split is Dangerous</h3>
<p><strong>Example:</strong> You have 1000 samples and split 80/20. By random chance, your test set might contain mostly “easy” examples → inflated accuracy. Or it might contain mostly “hard” examples → underestimated accuracy.</p>
<p>With 5-fold CV: you get 5 performance estimates, their average is more reliable, and the standard deviation tells you how stable the model is.</p>
</section>
<section id="variants-for-different-scenarios" class="level3">
<h3 class="anchored" data-anchor-id="variants-for-different-scenarios">Variants for Different Scenarios</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    Q["What kind of data?"] --&gt; A["Balanced classification"]
    Q --&gt; B["Imbalanced classification"]
    Q --&gt; C["User-level data"]
    Q --&gt; D["Time series"]

    A --&gt; A1["Standard K-Fold"]
    B --&gt; B1["Stratified K-Fold&lt;br/&gt;(preserves class ratio in each fold)"]
    C --&gt; C1["Group K-Fold&lt;br/&gt;(all data from one user stays together)"]
    D --&gt; D1["Time Series Split&lt;br/&gt;(train on past, validate on future)"]

    style A1 fill:#6cc3d5,stroke:#333,color:#fff
    style B1 fill:#56cc9d,stroke:#333,color:#fff
    style C1 fill:#ffce67,stroke:#333
    style D1 fill:#ff7851,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="concrete-example-model-selection" class="level3">
<h3 class="anchored" data-anchor-id="concrete-example-model-selection">Concrete Example: Model Selection</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> cross_val_score</span>
<span id="cb3-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.ensemble <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> RandomForestClassifier, GradientBoostingClassifier</span>
<span id="cb3-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.linear_model <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> LogisticRegression</span>
<span id="cb3-4"></span>
<span id="cb3-5">models <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb3-6">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Logistic Regression"</span>: LogisticRegression(),</span>
<span id="cb3-7">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Random Forest"</span>: RandomForestClassifier(n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>),</span>
<span id="cb3-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Gradient Boosting"</span>: GradientBoostingClassifier(n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)</span>
<span id="cb3-9">}</span>
<span id="cb3-10"></span>
<span id="cb3-11"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> name, model <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> models.items():</span>
<span id="cb3-12">    scores <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> cross_val_score(model, X, y, cv<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, scoring<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'f1'</span>)</span>
<span id="cb3-13">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>name<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>scores<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>mean()<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> ± </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>scores<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>std()<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb3-14"></span>
<span id="cb3-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Output:</span></span>
<span id="cb3-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Logistic Regression: 0.782 ± 0.015  ← stable but lower</span></span>
<span id="cb3-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Random Forest:       0.841 ± 0.022  ← good balance</span></span>
<span id="cb3-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Gradient Boosting:   0.856 ± 0.031  ← highest but more variable</span></span></code></pre></div></div>
</section>
<section id="application-2" class="level3">
<h3 class="anchored" data-anchor-id="application-2">Application</h3>
<ul>
<li><strong>Hyperparameter tuning:</strong> Use CV inside GridSearchCV/RandomizedSearchCV to select best hyperparameters without touching the test set.</li>
<li><strong>Model comparison:</strong> The model with highest mean CV score AND acceptable variance wins.</li>
<li><strong>Small datasets:</strong> Use Leave-One-Out CV (k = N) when data is very limited.</li>
</ul>
<hr>
</section>
</section>
<section id="q7-how-does-logistic-regression-work" class="level2">
<h2 class="anchored" data-anchor-id="q7-how-does-logistic-regression-work">Q7: How does logistic regression work?</h2>
<p><strong>Answer:</strong></p>
<p>Logistic regression models the <strong>probability</strong> of a binary outcome by applying the sigmoid function to a linear combination of features:</p>
<p><img src="https://latex.codecogs.com/png.latex?P(y=1%7Cx)%20=%20%5Csigma(w%5ET%20x%20+%20b)%20=%20%5Cfrac%7B1%7D%7B1%20+%20e%5E%7B-(w%5ET%20x%20+%20b)%7D%7D"></p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    A["Input features&lt;br/&gt;x₁, x₂, ..., xₙ"] --&gt; B["Linear combination&lt;br/&gt;z = w₁x₁ + w₂x₂ + ... + b"]
    B --&gt; C["Sigmoid function&lt;br/&gt;σ(z) = 1/(1+e⁻ᶻ)"]
    C --&gt; D["Probability&lt;br/&gt;P(y=1) ∈ [0, 1]"]
    D --&gt; E{"P &gt; threshold?"}
    E --&gt;|Yes| F["Predict: Positive"]
    E --&gt;|No| G["Predict: Negative"]

    style C fill:#56cc9d,stroke:#333,color:#fff
    style D fill:#6cc3d5,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="step-by-step-example-loan-default-prediction" class="level3">
<h3 class="anchored" data-anchor-id="step-by-step-example-loan-default-prediction">Step-by-Step Example: Loan Default Prediction</h3>
<p><strong>Features:</strong> income ($50K), debt_ratio (0.4), credit_score (680)</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 1: Linear combination</span></span>
<span id="cb4-2">z <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03</span>×<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> (<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.5</span>)×<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>×<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">680</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> (<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">8.0</span>)</span>
<span id="cb4-3">z <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.5</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">6.8</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">8.0</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7</span></span>
<span id="cb4-4"></span>
<span id="cb4-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 2: Sigmoid</span></span>
<span id="cb4-6">P(default) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> e<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.01</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.33</span></span>
<span id="cb4-7"></span>
<span id="cb4-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 3: Decision (threshold = 0.5)</span></span>
<span id="cb4-9"><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.33</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span> → Predict: No Default</span></code></pre></div></div>
</section>
<section id="interpreting-coefficients-as-odds-ratios" class="level3">
<h3 class="anchored" data-anchor-id="interpreting-coefficients-as-odds-ratios">Interpreting Coefficients as Odds Ratios</h3>
<p>Each coefficient represents the change in <strong>log-odds</strong> per unit increase in the feature:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Clog%5Cfrac%7BP%7D%7B1-P%7D%20=%20w%5ET%20x%20+%20b"></p>
<ul>
<li>If <img src="https://latex.codecogs.com/png.latex?w_%7B%5Ctext%7Bincome%7D%7D%20=%200.03">: each $1K increase in income multiplies the odds by <img src="https://latex.codecogs.com/png.latex?e%5E%7B0.03%7D%20=%201.03"> (3% increase)</li>
<li>If <img src="https://latex.codecogs.com/png.latex?w_%7B%5Ctext%7Bdebt%5C_ratio%7D%7D%20=%20-2.5">: each 0.1 increase in debt ratio multiplies odds by <img src="https://latex.codecogs.com/png.latex?e%5E%7B-0.25%7D%20=%200.78"> (22% decrease)</li>
</ul>
</section>
<section id="when-to-use-logistic-regression" class="level3">
<h3 class="anchored" data-anchor-id="when-to-use-logistic-regression">When to Use Logistic Regression</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    LR["Logistic Regression"] --&gt; GOOD["✅ Good for"]
    LR --&gt; BAD["❌ Not ideal for"]

    GOOD --&gt; G1["Interpretable models (finance, healthcare)"]
    GOOD --&gt; G2["Baseline model (fast, reliable)"]
    GOOD --&gt; G3["Linearly separable problems"]
    GOOD --&gt; G4["Calibrated probability outputs"]

    BAD --&gt; B1["Complex non-linear boundaries"]
    BAD --&gt; B2["Image/text data (use deep learning)"]
    BAD --&gt; B3["Heavy feature interactions (use trees)"]

    style GOOD fill:#56cc9d,stroke:#333,color:#fff
    style BAD fill:#ff7851,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="application-3" class="level3">
<h3 class="anchored" data-anchor-id="application-3">Application</h3>
<ul>
<li><strong>Credit scoring:</strong> Banks use logistic regression because regulators require interpretable models.</li>
<li><strong>Medical diagnosis:</strong> Probability output directly gives “risk score” for patients.</li>
<li><strong>A/B testing:</strong> Quick baseline to measure treatment effect.</li>
<li><strong>Production ML:</strong> Often the first model deployed because it’s fast, stable, and explainable.</li>
</ul>
<hr>
</section>
</section>
<section id="q8-what-is-a-decision-tree-and-how-does-it-split" class="level2">
<h2 class="anchored" data-anchor-id="q8-what-is-a-decision-tree-and-how-does-it-split">Q8: What is a decision tree and how does it split?</h2>
<p><strong>Answer:</strong></p>
<p>A decision tree recursively partitions the feature space by selecting the best feature and threshold at each node to maximize class separation (or minimize variance for regression).</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    A["All data&lt;br/&gt;(100 samples)"] --&gt; B{"Income &gt; $50K?"}
    B --&gt;|Yes: 60 samples| C{"Credit Score &gt; 700?"}
    B --&gt;|No: 40 samples| D{"Debt Ratio &gt; 0.5?"}
    C --&gt;|Yes: 45 samples| E["✅ Approve Loan&lt;br/&gt;(90% approve rate)"]
    C --&gt;|No: 15 samples| F["⚠️ Review&lt;br/&gt;(60% approve rate)"]
    D --&gt;|Yes: 25 samples| G["❌ Deny Loan&lt;br/&gt;(85% deny rate)"]
    D --&gt;|No: 15 samples| H["⚠️ Review&lt;br/&gt;(55% approve rate)"]

    style E fill:#56cc9d,stroke:#333,color:#fff
    style G fill:#ff7851,stroke:#333,color:#fff
    style F fill:#ffce67,stroke:#333
    style H fill:#ffce67,stroke:#333
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="how-splitting-works-gini-impurity" class="level3">
<h3 class="anchored" data-anchor-id="how-splitting-works-gini-impurity">How Splitting Works: Gini Impurity</h3>
<p>At each node, the tree evaluates every feature and every possible threshold to find the split that <strong>minimizes impurity</strong> in the resulting child nodes.</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BGini%7D(node)%20=%201%20-%20%5Csum_%7Bi=1%7D%5E%7BC%7D%20p_i%5E2"></p>
<p><strong>Example:</strong> A node has 100 samples: 70 Class A, 30 Class B.</p>
<pre class="text"><code>Gini = 1 - (0.7² + 0.3²) = 1 - (0.49 + 0.09) = 0.42

After splitting on "Age &gt; 30":
  Left child:  50 samples (45 A, 5 B)  → Gini = 1 - (0.9² + 0.1²) = 0.18
  Right child: 50 samples (25 A, 25 B) → Gini = 1 - (0.5² + 0.5²) = 0.50

Weighted Gini = (50/100)×0.18 + (50/100)×0.50 = 0.34
Improvement = 0.42 - 0.34 = 0.08 ← the tree selects the split that maximizes this</code></pre>
</section>
<section id="controlling-overfitting" class="level3">
<h3 class="anchored" data-anchor-id="controlling-overfitting">Controlling Overfitting</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 38%">
<col style="width: 20%">
<col style="width: 41%">
</colgroup>
<thead>
<tr class="header">
<th>Hyperparameter</th>
<th>Effect</th>
<th>Typical Values</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>max_depth</code></td>
<td>Limits tree depth</td>
<td>3-10</td>
</tr>
<tr class="even">
<td><code>min_samples_split</code></td>
<td>Minimum samples to allow a split</td>
<td>5-50</td>
</tr>
<tr class="odd">
<td><code>min_samples_leaf</code></td>
<td>Minimum samples in a leaf node</td>
<td>3-20</td>
</tr>
<tr class="even">
<td><code>max_features</code></td>
<td>Features considered per split</td>
<td>sqrt(n), log2(n)</td>
</tr>
<tr class="odd">
<td>Post-pruning</td>
<td>Remove branches that don’t improve validation</td>
<td>Cost-complexity pruning</td>
</tr>
</tbody>
</table>
</section>
<section id="advantages-and-disadvantages" class="level3">
<h3 class="anchored" data-anchor-id="advantages-and-disadvantages">Advantages and Disadvantages</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 42%">
<col style="width: 57%">
</colgroup>
<thead>
<tr class="header">
<th>Advantages</th>
<th>Disadvantages</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Highly interpretable (show to stakeholders)</td>
<td>Prone to overfitting</td>
</tr>
<tr class="even">
<td>No feature scaling needed</td>
<td>Unstable (small data changes → different tree)</td>
</tr>
<tr class="odd">
<td>Handles non-linear relationships</td>
<td>Greedy algorithm (not globally optimal)</td>
</tr>
<tr class="even">
<td>Handles mixed data types</td>
<td>Biased toward features with many levels</td>
</tr>
</tbody>
</table>
</section>
<section id="application-4" class="level3">
<h3 class="anchored" data-anchor-id="application-4">Application</h3>
<ul>
<li><strong>Healthcare:</strong> Clinical decision rules (“If blood pressure &gt; X AND cholesterol &gt; Y → high risk”)</li>
<li><strong>Manufacturing:</strong> Root cause analysis (which conditions lead to defects)</li>
<li><strong>Customer service:</strong> Decision flows (routing tickets based on features)</li>
<li><strong>As building blocks:</strong> Foundation for Random Forest and Gradient Boosting</li>
</ul>
<hr>
</section>
</section>
<section id="q9-how-does-random-forest-improve-upon-a-single-decision-tree" class="level2">
<h2 class="anchored" data-anchor-id="q9-how-does-random-forest-improve-upon-a-single-decision-tree">Q9: How does Random Forest improve upon a single decision tree?</h2>
<p><strong>Answer:</strong></p>
<p>Random Forest is a <strong>bagging</strong> ensemble that builds many decorrelated decision trees and aggregates their predictions to reduce variance while maintaining low bias.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    DATA["Training Data&lt;br/&gt;(N samples, M features)"] --&gt; BS1["Bootstrap Sample 1&lt;br/&gt;(N samples with replacement)"]
    DATA --&gt; BS2["Bootstrap Sample 2&lt;br/&gt;(N samples with replacement)"]
    DATA --&gt; BS3["Bootstrap Sample 3&lt;br/&gt;(N samples with replacement)"]
    DATA --&gt; BSN["... Bootstrap Sample K"]

    BS1 --&gt; T1["Tree 1&lt;br/&gt;(random √M features per split)"]
    BS2 --&gt; T2["Tree 2&lt;br/&gt;(random √M features per split)"]
    BS3 --&gt; T3["Tree 3&lt;br/&gt;(random √M features per split)"]
    BSN --&gt; TN["Tree K&lt;br/&gt;(random √M features per split)"]

    T1 --&gt; AGG["Aggregate Predictions"]
    T2 --&gt; AGG
    T3 --&gt; AGG
    TN --&gt; AGG

    AGG --&gt; CLS["Classification: Majority Vote"]
    AGG --&gt; REG["Regression: Average"]

    style DATA fill:#6cc3d5,stroke:#333,color:#fff
    style AGG fill:#56cc9d,stroke:#333,color:#fff
    style CLS fill:#ffce67,stroke:#333
    style REG fill:#ffce67,stroke:#333
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="why-it-works-decorrelation-reduces-variance" class="level3">
<h3 class="anchored" data-anchor-id="why-it-works-decorrelation-reduces-variance">Why It Works: Decorrelation Reduces Variance</h3>
<p><strong>Key insight:</strong> If you average <img src="https://latex.codecogs.com/png.latex?n"> independent predictions each with variance <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E2">, the ensemble variance is <img src="https://latex.codecogs.com/png.latex?%5Csigma%5E2/n">. But trees trained on the same data are correlated. Random Forest decorrelates them by:</p>
<ol type="1">
<li><strong>Bootstrap sampling:</strong> Each tree sees a different subset of data (~63% unique samples per tree)</li>
<li><strong>Feature randomization:</strong> Each split considers only <img src="https://latex.codecogs.com/png.latex?%5Csqrt%7BM%7D"> random features (classification) or <img src="https://latex.codecogs.com/png.latex?M/3"> (regression)</li>
</ol>
</section>
<section id="example-fraud-detection" class="level3">
<h3 class="anchored" data-anchor-id="example-fraud-detection">Example: Fraud Detection</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.ensemble <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> RandomForestClassifier</span>
<span id="cb6-2"></span>
<span id="cb6-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Single Decision Tree: Accuracy 82%, highly variable</span></span>
<span id="cb6-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Random Forest: Accuracy 91%, stable across runs</span></span>
<span id="cb6-5"></span>
<span id="cb6-6">rf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomForestClassifier(</span>
<span id="cb6-7">    n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>,     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 500 trees</span></span>
<span id="cb6-8">    max_depth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>,         <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># limit individual tree complexity</span></span>
<span id="cb6-9">    max_features<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sqrt'</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># √M features per split</span></span>
<span id="cb6-10">    min_samples_leaf<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># prevent tiny leaves</span></span>
<span id="cb6-11">    oob_score<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># free validation estimate!</span></span>
<span id="cb6-12">)</span>
<span id="cb6-13">rf.fit(X_train, y_train)</span>
<span id="cb6-14"></span>
<span id="cb6-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Out-of-Bag score (free cross-validation):</span></span>
<span id="cb6-16"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"OOB Score: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>rf<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>oob_score_<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.3f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 0.908</span></span>
<span id="cb6-17"></span>
<span id="cb6-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Feature importance:</span></span>
<span id="cb6-19">importances <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> rf.feature_importances_</span>
<span id="cb6-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># transaction_amount: 0.25, time_since_last: 0.18, ...</span></span></code></pre></div></div>
</section>
<section id="out-of-bag-oob-estimation" class="level3">
<h3 class="anchored" data-anchor-id="out-of-bag-oob-estimation">Out-of-Bag (OOB) Estimation</h3>
<p>Each tree doesn’t see ~37% of the data (not in its bootstrap sample). These “out-of-bag” samples provide a <strong>free validation estimate</strong> without needing a separate validation set.</p>
</section>
<section id="comparison-single-tree-vs.-random-forest" class="level3">
<h3 class="anchored" data-anchor-id="comparison-single-tree-vs.-random-forest">Comparison: Single Tree vs.&nbsp;Random Forest</h3>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Aspect</th>
<th>Single Decision Tree</th>
<th>Random Forest</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Bias</td>
<td>Low</td>
<td>Low (same)</td>
</tr>
<tr class="even">
<td>Variance</td>
<td>High</td>
<td>Low (reduced by averaging)</td>
</tr>
<tr class="odd">
<td>Interpretability</td>
<td>High (single path)</td>
<td>Lower (many trees)</td>
</tr>
<tr class="even">
<td>Overfitting risk</td>
<td>High</td>
<td>Low</td>
</tr>
<tr class="odd">
<td>Training speed</td>
<td>Fast</td>
<td>Slower (but parallelizable)</td>
</tr>
<tr class="even">
<td>Feature importance</td>
<td>Unreliable</td>
<td>Reliable (averaged)</td>
</tr>
</tbody>
</table>
</section>
<section id="application-5" class="level3">
<h3 class="anchored" data-anchor-id="application-5">Application</h3>
<ul>
<li><strong>Default production model:</strong> When you need something that works well with minimal tuning</li>
<li><strong>Feature selection:</strong> Use feature importances to identify key variables</li>
<li><strong>Anomaly detection:</strong> Isolation Forest (variant) for outlier detection</li>
<li><strong>Missing data:</strong> Handles missing values via surrogate splits in some implementations</li>
</ul>
<hr>
</section>
</section>
<section id="q10-what-is-the-difference-between-bagging-and-boosting" class="level2">
<h2 class="anchored" data-anchor-id="q10-what-is-the-difference-between-bagging-and-boosting">Q10: What is the difference between bagging and boosting?</h2>
<p><strong>Answer:</strong></p>
<p>Both are ensemble methods that combine multiple weak learners, but they differ fundamentally in <strong>how</strong> they build and combine models.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph bagging["BAGGING (Bootstrap Aggregating)"]
        direction TB
        BA["Original Data"] --&gt; B1["Bootstrap 1"]
        BA --&gt; B2["Bootstrap 2"]
        BA --&gt; B3["Bootstrap 3"]
        B1 --&gt; M1["Model 1"]
        B2 --&gt; M2["Model 2"]
        B3 --&gt; M3["Model 3"]
        M1 --&gt; VOTE["Average / Vote"]
        M2 --&gt; VOTE
        M3 --&gt; VOTE
    end

    subgraph boosting["BOOSTING (Sequential)"]
        direction TB
        BO["Original Data"] --&gt; BM1["Model 1"]
        BM1 --&gt; ERR1["Errors from Model 1"]
        ERR1 --&gt; BM2["Model 2&lt;br/&gt;(focuses on errors)"]
        BM2 --&gt; ERR2["Errors from Model 1+2"]
        ERR2 --&gt; BM3["Model 3&lt;br/&gt;(focuses on remaining errors)"]
        BM3 --&gt; WSUM["Weighted Sum"]
    end

    style bagging fill:#56cc9d,stroke:#333,color:#fff
    style boosting fill:#6cc3d5,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="detailed-comparison" class="level3">
<h3 class="anchored" data-anchor-id="detailed-comparison">Detailed Comparison</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 29%">
<col style="width: 33%">
<col style="width: 37%">
</colgroup>
<thead>
<tr class="header">
<th>Aspect</th>
<th>Bagging</th>
<th>Boosting</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Training</td>
<td>Parallel (independent)</td>
<td>Sequential (each depends on previous)</td>
</tr>
<tr class="even">
<td>Focus</td>
<td>Random subsets of data</td>
<td>Misclassified / high-error samples</td>
</tr>
<tr class="odd">
<td>Reduces</td>
<td>Variance</td>
<td>Bias</td>
</tr>
<tr class="even">
<td>Overfitting</td>
<td>Resistant</td>
<td>Can overfit if not regularized</td>
</tr>
<tr class="odd">
<td>Speed</td>
<td>Parallelizable → fast</td>
<td>Sequential → slower</td>
</tr>
<tr class="even">
<td>Typical base learner</td>
<td>Deep trees (high variance)</td>
<td>Shallow trees (high bias)</td>
</tr>
<tr class="odd">
<td>Key example</td>
<td>Random Forest</td>
<td>XGBoost, LightGBM, AdaBoost</td>
</tr>
</tbody>
</table>
</section>
<section id="example-predicting-customer-churn" class="level3">
<h3 class="anchored" data-anchor-id="example-predicting-customer-churn">Example: Predicting Customer Churn</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Bagging approach — Random Forest</span></span>
<span id="cb7-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.ensemble <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> RandomForestClassifier</span>
<span id="cb7-3">rf <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomForestClassifier(n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>, max_depth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>)</span>
<span id="cb7-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Each tree is deep (low bias, high variance)</span></span>
<span id="cb7-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Averaging reduces variance → good overall</span></span>
<span id="cb7-6"></span>
<span id="cb7-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Boosting approach — XGBoost</span></span>
<span id="cb7-8"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> xgboost <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> xgb</span>
<span id="cb7-9">xgb_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> xgb.XGBClassifier(</span>
<span id="cb7-10">    n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>,</span>
<span id="cb7-11">    max_depth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,         <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># shallow trees (high bias)</span></span>
<span id="cb7-12">    learning_rate<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>,   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># shrinkage — don't trust each tree fully</span></span>
<span id="cb7-13">    subsample<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span></span>
<span id="cb7-14">)</span>
<span id="cb7-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Each tree corrects previous errors → reduces bias iteratively</span></span></code></pre></div></div>
</section>
<section id="when-to-choose-which" class="level3">
<h3 class="anchored" data-anchor-id="when-to-choose-which">When to Choose Which</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    Q["Which ensemble?"] --&gt; Q1{"Noisy data or&lt;br/&gt;risk of overfitting?"}
    Q1 --&gt;|Yes| BAG["Bagging&lt;br/&gt;(Random Forest)"]
    Q1 --&gt;|No| Q2{"Need maximum accuracy&lt;br/&gt;and can tune carefully?"}
    Q2 --&gt;|Yes| BOOST["Boosting&lt;br/&gt;(XGBoost / LightGBM)"]
    Q2 --&gt;|No| BAG2["Bagging&lt;br/&gt;(safer default)"]

    style BAG fill:#56cc9d,stroke:#333,color:#fff
    style BAG2 fill:#56cc9d,stroke:#333,color:#fff
    style BOOST fill:#6cc3d5,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<p><strong>Choose Bagging when:</strong></p>
<ul>
<li>Data is noisy or has many outliers</li>
<li>You want a robust model with minimal tuning</li>
<li>You need parallelized training for speed</li>
<li>Overfitting is a primary concern</li>
</ul>
<p><strong>Choose Boosting when:</strong></p>
<ul>
<li>You need maximum predictive accuracy (Kaggle competitions, production ranking)</li>
<li>You have clean data and can invest in hyperparameter tuning</li>
<li>The problem has high bias (complex patterns to capture)</li>
<li>You have proper validation to detect overfitting</li>
</ul>
</section>
<section id="real-world-application" class="level3">
<h3 class="anchored" data-anchor-id="real-world-application">Real-World Application</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 37%">
<col style="width: 44%">
<col style="width: 18%">
</colgroup>
<thead>
<tr class="header">
<th>Scenario</th>
<th>Recommended</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>First model in production</td>
<td>Random Forest</td>
<td>Robust, minimal tuning</td>
</tr>
<tr class="even">
<td>Kaggle competition</td>
<td>XGBoost/LightGBM</td>
<td>Maximum accuracy</td>
</tr>
<tr class="odd">
<td>Noisy sensor data</td>
<td>Random Forest</td>
<td>Handles noise better</td>
</tr>
<tr class="even">
<td>Ranking / search</td>
<td>LightGBM (LambdaMART)</td>
<td>Industry standard for learning-to-rank</td>
</tr>
<tr class="odd">
<td>Large-scale (millions of rows)</td>
<td>LightGBM</td>
<td>Faster than XGBoost on large data</td>
</tr>
</tbody>
</table>
<hr>
</section>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<table class="caption-top table">
<colgroup>
<col style="width: 27%">
<col style="width: 35%">
<col style="width: 37%">
</colgroup>
<thead>
<tr class="header">
<th>Question</th>
<th>Core Concept</th>
<th>Key Takeaway</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Q1</td>
<td>Learning paradigms</td>
<td>Match the paradigm to your data and feedback type</td>
</tr>
<tr class="even">
<td>Q2</td>
<td>Bias-variance</td>
<td>Diagnose underfitting vs.&nbsp;overfitting from error patterns</td>
</tr>
<tr class="odd">
<td>Q3</td>
<td>Overfitting</td>
<td>Prevention is cheaper than cure — regularize early</td>
</tr>
<tr class="even">
<td>Q4</td>
<td>Regularization</td>
<td>L1 for feature selection, L2 for stability</td>
</tr>
<tr class="odd">
<td>Q5</td>
<td>Gradient descent</td>
<td>Adam is the default; understand why</td>
</tr>
<tr class="even">
<td>Q6</td>
<td>Cross-validation</td>
<td>Never trust a single split; CV gives confidence intervals</td>
</tr>
<tr class="odd">
<td>Q7</td>
<td>Logistic regression</td>
<td>The interpretable baseline every ML engineer should try first</td>
</tr>
<tr class="even">
<td>Q8</td>
<td>Decision trees</td>
<td>Intuitive but overfit; control with depth and pruning</td>
</tr>
<tr class="odd">
<td>Q9</td>
<td>Random Forest</td>
<td>Decorrelation is the magic — averaging independent errors</td>
</tr>
<tr class="even">
<td>Q10</td>
<td>Bagging vs.&nbsp;Boosting</td>
<td>Bagging reduces variance; boosting reduces bias</td>
</tr>
</tbody>
</table>
<blockquote class="blockquote">
<p><strong>Next:</strong> <a href="../../posts/ml-interview/ML-Interview-QA-2.html">ML Interview QA - 2</a> covers evaluation metrics (precision, recall, ROC-AUC), feature engineering, PCA, handling imbalanced data, missing values, and data leakage.</p>
</blockquote>
<!-- nav-buttons -->
<p><a href="../../posts/ml-interview/ML-Interview-QA-2.html" class="btn btn-primary btn-lg"><i class="fa-solid fa-brain" aria-label="brain"></i> ML Interview QA - 2</a> <a href="../../index.html" class="btn btn-primary btn-lg"><i class="fa-solid fa-house" aria-label="house"></i> Home</a></p>


</section>

 ]]></description>
  <guid>https://vectoringai.com/posts/ml-interview/ML-Interview-QA-1.html</guid>
  <pubDate>Sun, 17 May 2026 00:00:00 GMT</pubDate>
  <media:content url="https://vectoringai.com/images/ml-interview/thumb_ML_interview_qa_300.png" medium="image" type="image/png" height="96" width="144"/>
</item>
<item>
  <title>ML Interview QA - 2</title>
  <dc:creator>Vectoring AI</dc:creator>
  <link>https://vectoringai.com/posts/ml-interview/ML-Interview-QA-2.html</link>
  <description><![CDATA[ 




<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>This is <strong>Part 2</strong> of our ML Interview QA series. It covers 10 questions on evaluation metrics, feature engineering, and data handling — the practical skills that separate candidates who build models from those who build <strong>reliable</strong> models.</p>
<blockquote class="blockquote">
<p>For foundational concepts (bias-variance, algorithms, ensembles), see <a href="../../posts/ml-interview/ML-Interview-QA-1.html">ML Interview QA - 1</a>.</p>
</blockquote>
<hr>
</section>
<section id="q1-explain-precision-recall-and-f1-score." class="level2">
<h2 class="anchored" data-anchor-id="q1-explain-precision-recall-and-f1-score.">Q1: Explain precision, recall, and F1-score.</h2>
<p><strong>Answer:</strong></p>
<p>These metrics go beyond accuracy to measure <strong>specific types of errors</strong> in classification.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph cm["Confusion Matrix"]
        direction TB
        TP["True Positive (TP)&lt;br/&gt;Correctly predicted positive"]
        FP["False Positive (FP)&lt;br/&gt;Incorrectly predicted positive&lt;br/&gt;(Type I error)"]
        FN["False Negative (FN)&lt;br/&gt;Missed positive&lt;br/&gt;(Type II error)"]
        TN["True Negative (TN)&lt;br/&gt;Correctly predicted negative"]
    end

    cm --&gt; PREC["Precision = TP/(TP+FP)&lt;br/&gt;Of those I flagged,&lt;br/&gt;how many are correct?"]
    cm --&gt; REC["Recall = TP/(TP+FN)&lt;br/&gt;Of all positives,&lt;br/&gt;how many did I catch?"]
    PREC --&gt; F1["F1 = 2·P·R/(P+R)&lt;br/&gt;Harmonic mean&lt;br/&gt;balances both"]
    REC --&gt; F1

    style TP fill:#56cc9d,stroke:#333,color:#fff
    style TN fill:#56cc9d,stroke:#333,color:#fff
    style FP fill:#ff7851,stroke:#333,color:#fff
    style FN fill:#ffce67,stroke:#333
    style F1 fill:#6cc3d5,stroke:#333,color:#fff
    style cm fill:#fff,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="formulas" class="level3">
<h3 class="anchored" data-anchor-id="formulas">Formulas</h3>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BPrecision%7D%20=%20%5Cfrac%7BTP%7D%7BTP%20+%20FP%7D%20%5Cqquad%20%5Ctext%7BRecall%7D%20=%20%5Cfrac%7BTP%7D%7BTP%20+%20FN%7D%20%5Cqquad%20F1%20=%20%5Cfrac%7B2%20%5Ccdot%20P%20%5Ccdot%20R%7D%7BP%20+%20R%7D"></p>
</section>
<section id="example-email-spam-filter" class="level3">
<h3 class="anchored" data-anchor-id="example-email-spam-filter">Example: Email Spam Filter</h3>
<p>Out of 1000 emails: 50 actual spam, 950 legitimate.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 29%">
<col style="width: 11%">
<col style="width: 11%">
<col style="width: 11%">
<col style="width: 11%">
<col style="width: 11%">
<col style="width: 11%">
</colgroup>
<thead>
<tr class="header">
<th>Scenario</th>
<th>TP</th>
<th>FP</th>
<th>FN</th>
<th>Precision</th>
<th>Recall</th>
<th>F1</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Aggressive filter</strong></td>
<td>48</td>
<td>30</td>
<td>2</td>
<td>48/78 = 0.62</td>
<td>48/50 = 0.96</td>
<td>0.75</td>
</tr>
<tr class="even">
<td><strong>Conservative filter</strong></td>
<td>40</td>
<td>2</td>
<td>10</td>
<td>40/42 = 0.95</td>
<td>40/50 = 0.80</td>
<td>0.87</td>
</tr>
</tbody>
</table>
<ul>
<li><strong>Aggressive:</strong> Catches almost all spam (high recall) but blocks 30 good emails (low precision)</li>
<li><strong>Conservative:</strong> Rarely blocks good emails (high precision) but misses 10 spam (lower recall)</li>
</ul>
</section>
<section id="the-precision-recall-tradeoff" class="level3">
<h3 class="anchored" data-anchor-id="the-precision-recall-tradeoff">The Precision-Recall Tradeoff</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    subgraph low_threshold["Low Threshold (0.2)"]
        LT["Predict more as positive&lt;br/&gt;↑ Recall, ↓ Precision"]
    end
    subgraph mid_threshold["Medium Threshold (0.5)"]
        MT["Balanced trade-off"]
    end
    subgraph high_threshold["High Threshold (0.8)"]
        HT["Predict fewer as positive&lt;br/&gt;↓ Recall, ↑ Precision"]
    end

    low_threshold --&gt; mid_threshold --&gt; high_threshold

    style low_threshold fill:#6cc3d5,stroke:#333,color:#fff
    style mid_threshold fill:#56cc9d,stroke:#333,color:#fff
    style high_threshold fill:#ffce67,stroke:#333
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="when-to-prioritize-which" class="level3">
<h3 class="anchored" data-anchor-id="when-to-prioritize-which">When to Prioritize Which</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 24%">
<col style="width: 48%">
<col style="width: 27%">
</colgroup>
<thead>
<tr class="header">
<th>Metric</th>
<th>Prioritize When</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Precision</strong></td>
<td>False positives are costly</td>
<td>Spam filter (don’t block important emails)</td>
</tr>
<tr class="even">
<td><strong>Recall</strong></td>
<td>False negatives are costly</td>
<td>Cancer screening (don’t miss tumors)</td>
</tr>
<tr class="odd">
<td><strong>F1</strong></td>
<td>Need single balanced metric</td>
<td>General classification with imbalanced classes</td>
</tr>
<tr class="even">
<td><strong>F-beta</strong></td>
<td>Custom tradeoff needed</td>
<td>F2 (recall 2x important), F0.5 (precision 2x important)</td>
</tr>
</tbody>
</table>
</section>
<section id="application" class="level3">
<h3 class="anchored" data-anchor-id="application">Application</h3>
<ul>
<li><strong>Fraud detection:</strong> Optimize recall (catch all fraud), accept some false positives that humans review</li>
<li><strong>Search engines:</strong> Optimize precision (show only relevant results)</li>
<li><strong>Medical AI:</strong> Regulatory bodies often require minimum recall thresholds</li>
<li><strong>Content moderation:</strong> Balance — too aggressive frustrates users, too lenient misses harmful content</li>
</ul>
<hr>
</section>
</section>
<section id="q2-what-is-the-roc-curve-and-auc" class="level2">
<h2 class="anchored" data-anchor-id="q2-what-is-the-roc-curve-and-auc">Q2: What is the ROC curve and AUC?</h2>
<p><strong>Answer:</strong></p>
<p>The <strong>ROC curve</strong> (Receiver Operating Characteristic) visualizes classifier performance across <strong>all possible thresholds</strong> by plotting True Positive Rate vs.&nbsp;False Positive Rate.</p>
<p><img src="https://latex.codecogs.com/png.latex?TPR%20=%20%5Cfrac%7BTP%7D%7BTP%20+%20FN%7D%20%5Cqquad%20FPR%20=%20%5Cfrac%7BFP%7D%7BFP%20+%20TN%7D"></p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph roc["ROC Curve Interpretation"]
        direction TB
        PERFECT["Perfect classifier&lt;br/&gt;AUC = 1.0&lt;br/&gt;(top-left corner)"]
        GOOD["Good classifier&lt;br/&gt;AUC = 0.85&lt;br/&gt;(curve above diagonal)"]
        RANDOM["Random guessing&lt;br/&gt;AUC = 0.5&lt;br/&gt;(diagonal line)"]
        WORST["Inverse classifier&lt;br/&gt;AUC = 0.0&lt;br/&gt;(below diagonal)"]
    end

    style PERFECT fill:#56cc9d,stroke:#333,color:#fff
    style GOOD fill:#6cc3d5,stroke:#333,color:#fff
    style RANDOM fill:#ffce67,stroke:#333
    style WORST fill:#ff7851,stroke:#333,color:#fff
    style roc fill:#fff,color:#333
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="how-it-works-threshold-sweep" class="level3">
<h3 class="anchored" data-anchor-id="how-it-works-threshold-sweep">How It Works: Threshold Sweep</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    A["Model outputs&lt;br/&gt;probabilities&lt;br/&gt;for each sample"] --&gt; B["Sweep threshold&lt;br/&gt;from 0.0 to 1.0"]
    B --&gt; C["At each threshold:&lt;br/&gt;compute TPR and FPR"]
    C --&gt; D["Plot all (FPR, TPR)&lt;br/&gt;points → ROC curve"]
    D --&gt; E["Area Under Curve&lt;br/&gt;= AUC score"]

    style E fill:#56cc9d,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="example-comparing-two-models" class="level3">
<h3 class="anchored" data-anchor-id="example-comparing-two-models">Example: Comparing Two Models</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.metrics <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> roc_auc_score, roc_curve</span>
<span id="cb1-2"></span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Model A: Logistic Regression</span></span>
<span id="cb1-4">y_prob_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model_A.predict_proba(X_test)[:, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb1-5">auc_A <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> roc_auc_score(y_test, y_prob_A)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 0.82</span></span>
<span id="cb1-6"></span>
<span id="cb1-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Model B: Random Forest</span></span>
<span id="cb1-8">y_prob_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model_B.predict_proba(X_test)[:, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb1-9">auc_B <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> roc_auc_score(y_test, y_prob_B)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 0.91</span></span>
<span id="cb1-10"></span>
<span id="cb1-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Model B has better discrimination power</span></span>
<span id="cb1-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># It ranks positives higher than negatives more consistently</span></span></code></pre></div></div>
<p><strong>Interpretation of AUC = 0.91:</strong> If you randomly pick one positive sample and one negative sample, there’s a 91% probability that the model assigns a higher score to the positive sample.</p>
</section>
<section id="when-roc-auc-fails-imbalanced-data" class="level3">
<h3 class="anchored" data-anchor-id="when-roc-auc-fails-imbalanced-data">When ROC-AUC Fails: Imbalanced Data</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    A["Dataset: 10,000 samples&lt;br/&gt;9,900 negative, 100 positive"] --&gt; B["Model predicts ALL as negative"]
    B --&gt; C["FPR = 0/(0+9900) = 0&lt;br/&gt;TPR = 0/(0+100) = 0"]
    B --&gt; D["Accuracy = 99%&lt;br/&gt;Looks great!"]
    B --&gt; E["ROC-AUC can still be&lt;br/&gt;misleadingly high"]

    E --&gt; F["Solution: Use PR-AUC&lt;br/&gt;(Precision-Recall AUC)&lt;br/&gt;for imbalanced data"]

    style D fill:#ff7851,stroke:#333,color:#fff
    style F fill:#56cc9d,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="roc-auc-vs.-pr-auc" class="level3">
<h3 class="anchored" data-anchor-id="roc-auc-vs.-pr-auc">ROC-AUC vs.&nbsp;PR-AUC</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 34%">
<col style="width: 43%">
<col style="width: 21%">
</colgroup>
<thead>
<tr class="header">
<th>Metric</th>
<th>Best For</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>ROC-AUC</td>
<td>Balanced datasets</td>
<td>Considers both classes equally</td>
</tr>
<tr class="even">
<td>PR-AUC</td>
<td>Imbalanced datasets (rare positives)</td>
<td>Focuses on positive class performance</td>
</tr>
</tbody>
</table>
</section>
<section id="application-1" class="level3">
<h3 class="anchored" data-anchor-id="application-1">Application</h3>
<ul>
<li><strong>Model selection:</strong> Compare models that output probabilities (higher AUC = better ranking)</li>
<li><strong>Threshold selection:</strong> Pick the operating point on the ROC curve that matches business needs</li>
<li><strong>Clinical trials:</strong> Evaluate diagnostic tests across different decision thresholds</li>
<li><strong>Credit scoring:</strong> Regulators compare AUC across demographic groups for fairness</li>
</ul>
<hr>
</section>
</section>
<section id="q3-how-do-you-handle-imbalanced-datasets" class="level2">
<h2 class="anchored" data-anchor-id="q3-how-do-you-handle-imbalanced-datasets">Q3: How do you handle imbalanced datasets?</h2>
<p><strong>Answer:</strong></p>
<p>Class imbalance occurs when one class vastly outnumbers the other (e.g., 99% negative, 1% positive). Standard accuracy becomes meaningless — a model predicting “always negative” gets 99% accuracy.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    PROBLEM["Imbalanced Dataset&lt;br/&gt;e.g., 1% fraud, 99% legitimate"] --&gt; APPROACH["Multi-level approach"]

    APPROACH --&gt; L1["Level 1: Metrics&lt;br/&gt;(change how you measure)"]
    APPROACH --&gt; L2["Level 2: Algorithm&lt;br/&gt;(change how model learns)"]
    APPROACH --&gt; L3["Level 3: Data&lt;br/&gt;(change the data itself)"]

    L1 --&gt; L1A["Use F1, PR-AUC, recall&lt;br/&gt;instead of accuracy"]
    L2 --&gt; L2A["Class weights&lt;br/&gt;Threshold tuning&lt;br/&gt;Cost-sensitive learning"]
    L3 --&gt; L3A["SMOTE (oversample minority)&lt;br/&gt;Undersample majority&lt;br/&gt;Collect more minority data"]

    style L1 fill:#56cc9d,stroke:#333,color:#fff
    style L2 fill:#6cc3d5,stroke:#333,color:#fff
    style L3 fill:#ffce67,stroke:#333
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="strategy-priority-use-in-order" class="level3">
<h3 class="anchored" data-anchor-id="strategy-priority-use-in-order">Strategy Priority (use in order)</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    S1["1. Fix your METRICS first&lt;br/&gt;Stop using accuracy"] --&gt; S2["2. Try CLASS WEIGHTS&lt;br/&gt;(free, no data changes)"]
    S2 --&gt; S3["3. Tune THRESHOLD&lt;br/&gt;(adjust decision boundary)"]
    S3 --&gt; S4["4. Try RESAMPLING&lt;br/&gt;(SMOTE, undersampling)"]
    S4 --&gt; S5["5. Use specialized ENSEMBLES&lt;br/&gt;(Balanced RF, EasyEnsemble)"]

    style S1 fill:#56cc9d,stroke:#333,color:#fff
    style S2 fill:#56cc9d,stroke:#333,color:#fff
    style S3 fill:#6cc3d5,stroke:#333,color:#fff
    style S4 fill:#ffce67,stroke:#333,color:#fff
    style S5 fill:#ff7851,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="example-fraud-detection-0.3-fraud-rate" class="level3">
<h3 class="anchored" data-anchor-id="example-fraud-detection-0.3-fraud-rate">Example: Fraud Detection (0.3% fraud rate)</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.ensemble <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> RandomForestClassifier</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.metrics <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> classification_report, precision_recall_curve</span>
<span id="cb2-3"></span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># BAD: Default model</span></span>
<span id="cb2-5">rf_default <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomForestClassifier(n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)</span>
<span id="cb2-6">rf_default.fit(X_train, y_train)</span>
<span id="cb2-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Accuracy: 99.7% → but catches only 20% of fraud!</span></span>
<span id="cb2-8"></span>
<span id="cb2-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># BETTER: Class weights</span></span>
<span id="cb2-10">rf_weighted <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomForestClassifier(</span>
<span id="cb2-11">    n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,</span>
<span id="cb2-12">    class_weight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">333</span>}  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># inverse of class frequency</span></span>
<span id="cb2-13">)</span>
<span id="cb2-14">rf_weighted.fit(X_train, y_train)</span>
<span id="cb2-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Recall: 85% fraud caught, precision: 12% → many false alerts</span></span>
<span id="cb2-16"></span>
<span id="cb2-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># BEST: Threshold tuning after weighting</span></span>
<span id="cb2-18">y_proba <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> rf_weighted.predict_proba(X_test)[:, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb2-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Find threshold where precision ≥ 5% AND recall ≥ 80%</span></span>
<span id="cb2-20">precisions, recalls, thresholds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> precision_recall_curve(y_test, y_proba)</span>
<span id="cb2-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Choose threshold = 0.35 → Recall: 82%, Precision: 8%</span></span>
<span id="cb2-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Human reviewers handle 8% false alert rate</span></span></code></pre></div></div>
</section>
<section id="smote-synthetic-minority-oversampling" class="level3">
<h3 class="anchored" data-anchor-id="smote-synthetic-minority-oversampling">SMOTE (Synthetic Minority Oversampling)</h3>
<p>SMOTE creates <strong>synthetic</strong> minority samples by interpolating between existing minority samples and their k-nearest neighbors.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    A["Minority sample A"] --&gt; MID["New synthetic sample&lt;br/&gt;(random point between A and B)"]
    B["Nearest neighbor B"] --&gt; MID

    style MID fill:#56cc9d,stroke:#333,color:#fff
    style A fill:#6cc3d5,stroke:#333,color:#fff
    style B fill:#6cc3d5,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<p><strong>Caution:</strong> Always apply SMOTE <strong>only on training data</strong> (after splitting) — never on test/validation sets.</p>
</section>
<section id="application-2" class="level3">
<h3 class="anchored" data-anchor-id="application-2">Application</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 23%">
<col style="width: 47%">
<col style="width: 29%">
</colgroup>
<thead>
<tr class="header">
<th>Domain</th>
<th>Imbalance Ratio</th>
<th>Strategy</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Fraud detection</td>
<td>1:1000</td>
<td>Class weights + threshold tuning + human review</td>
</tr>
<tr class="even">
<td>Disease diagnosis</td>
<td>1:100</td>
<td>SMOTE + ensemble + high recall threshold</td>
</tr>
<tr class="odd">
<td>Manufacturing defects</td>
<td>1:500</td>
<td>Anomaly detection (one-class SVM, Isolation Forest)</td>
</tr>
<tr class="even">
<td>Click prediction</td>
<td>1:50</td>
<td>Calibrated probabilities + ranking metrics</td>
</tr>
</tbody>
</table>
<hr>
</section>
</section>
<section id="q4-what-is-feature-engineering-and-why-does-it-matter" class="level2">
<h2 class="anchored" data-anchor-id="q4-what-is-feature-engineering-and-why-does-it-matter">Q4: What is feature engineering and why does it matter?</h2>
<p><strong>Answer:</strong></p>
<p>Feature engineering is the process of creating, transforming, and selecting input features to improve model performance. It often has a <strong>greater impact</strong> than model choice or hyperparameter tuning.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    RAW["Raw Data"] --&gt; FE["Feature Engineering"]

    FE --&gt; CREATE["Create new features&lt;br/&gt;(domain knowledge)"]
    FE --&gt; TRANSFORM["Transform existing features&lt;br/&gt;(scaling, encoding)"]
    FE --&gt; SELECT["Select relevant features&lt;br/&gt;(remove noise)"]

    CREATE --&gt; EX1["age + income &lt;br/&gt; → income_per_year_of_age"]
    CREATE --&gt; EX2["timestamp &lt;br/&gt; → hour_of_day, is_weekend"]
    CREATE --&gt; EX3["lat + lon &lt;br/&gt; → distance_to_store"]

    TRANSFORM --&gt; EX4["log(income) &lt;br/&gt; — reduce skew"]
    TRANSFORM --&gt; EX5["one-hot(city) &lt;br/&gt; — encode categories"]
    TRANSFORM --&gt; EX6["StandardScaler &lt;br/&gt; — normalize ranges"]

    SELECT --&gt; EX7["Remove correlated features"]
    SELECT --&gt; EX8["L1 regularization &lt;br/&gt; → sparsity"]
    SELECT --&gt; EX9["Tree importance scores"]

    style FE fill:#56cc9d,stroke:#333,color:#fff
    style CREATE fill:#6cc3d5,stroke:#333,color:#fff
    style TRANSFORM fill:#ffce67,stroke:#333
    style SELECT fill:#ff7851,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="example-predicting-taxi-trip-duration" class="level3">
<h3 class="anchored" data-anchor-id="example-predicting-taxi-trip-duration">Example: Predicting Taxi Trip Duration</h3>
<p><strong>Raw features:</strong> pickup_time, pickup_lat, pickup_lon, dropoff_lat, dropoff_lon</p>
<p><strong>Engineered features (much more predictive):</strong></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb3-2"></span>
<span id="cb3-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Distance (Haversine formula)</span></span>
<span id="cb3-4">df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'distance_km'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> haversine(</span>
<span id="cb3-5">    df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pickup_lat'</span>], df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pickup_lon'</span>],</span>
<span id="cb3-6">    df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dropoff_lat'</span>], df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'dropoff_lon'</span>]</span>
<span id="cb3-7">)</span>
<span id="cb3-8"></span>
<span id="cb3-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Time-based features</span></span>
<span id="cb3-10">df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'hour'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pickup_time'</span>].dt.hour</span>
<span id="cb3-11">df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'is_rush_hour'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'hour'</span>].isin([<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">17</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">18</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">19</span>]).astype(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>)</span>
<span id="cb3-12">df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'is_weekend'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pickup_time'</span>].dt.dayofweek.isin([<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>]).astype(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>)</span>
<span id="cb3-13"></span>
<span id="cb3-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Interaction features</span></span>
<span id="cb3-15">df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'distance_x_rush'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'distance_km'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'is_rush_hour'</span>]</span>
<span id="cb3-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ^ During rush hour, distance has a MUCH bigger impact on duration</span></span>
<span id="cb3-17"></span>
<span id="cb3-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Aggregation features</span></span>
<span id="cb3-19">df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'avg_speed_this_hour'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df.groupby(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'hour'</span>)[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'speed'</span>].transform(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mean'</span>)</span></code></pre></div></div>
<p><strong>Result:</strong> Model accuracy improves from R² = 0.45 (raw features) to R² = 0.82 (engineered features) — <strong>same model, better features</strong>.</p>
</section>
<section id="feature-selection-methods" class="level3">
<h3 class="anchored" data-anchor-id="feature-selection-methods">Feature Selection Methods</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 15%">
<col style="width: 32%">
<col style="width: 32%">
</colgroup>
<thead>
<tr class="header">
<th>Method</th>
<th>Type</th>
<th>How it works</th>
<th>When to use</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Correlation filter</td>
<td>Filter</td>
<td>Remove features correlated &gt; 0.95 with others</td>
<td>Quick first pass</td>
</tr>
<tr class="even">
<td>Mutual information</td>
<td>Filter</td>
<td>Keep features with high MI with target</td>
<td>Non-linear relationships</td>
</tr>
<tr class="odd">
<td>Recursive elimination</td>
<td>Wrapper</td>
<td>Repeatedly remove least important feature</td>
<td>When compute allows</td>
</tr>
<tr class="even">
<td>L1 regularization</td>
<td>Embedded</td>
<td>Model zeros out irrelevant weights</td>
<td>Linear models</td>
</tr>
<tr class="odd">
<td>Tree importance</td>
<td>Embedded</td>
<td>Features that reduce impurity most</td>
<td>Tree-based models</td>
</tr>
</tbody>
</table>
</section>
<section id="application-3" class="level3">
<h3 class="anchored" data-anchor-id="application-3">Application</h3>
<ul>
<li><strong>E-commerce:</strong> RFM features (Recency, Frequency, Monetary) from transaction logs</li>
<li><strong>NLP:</strong> TF-IDF, n-grams, embedding features from text</li>
<li><strong>Finance:</strong> Moving averages, volatility, technical indicators from price data</li>
<li><strong>Computer Vision:</strong> HOG features, edge histograms (classical), or learned features (deep learning)</li>
</ul>
<hr>
</section>
</section>
<section id="q5-what-is-the-curse-of-dimensionality" class="level2">
<h2 class="anchored" data-anchor-id="q5-what-is-the-curse-of-dimensionality">Q5: What is the curse of dimensionality?</h2>
<p><strong>Answer:</strong></p>
<p>As features increase, the feature space grows exponentially, making data increasingly sparse and distance metrics less meaningful.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph d1["1D: Line"]
        D1["10 points fill a line well&lt;br/&gt;Dense coverage"]
    end
    subgraph d2["2D: Square"]
        D2["10 points in a square&lt;br/&gt;Getting sparse"]
    end
    subgraph d3["3D: Cube"]
        D3["10 points in a cube&lt;br/&gt;Very sparse"]
    end
    subgraph d100["100D: Hypercube"]
        D100["10 points in 100 dimensions&lt;br/&gt;Essentially EMPTY&lt;br/&gt;Need 10¹⁰⁰ points to fill!"]
    end

    d1 --&gt; d2 --&gt; d3 --&gt; d100

    style d1 fill:#56cc9d,stroke:#333,color:#fff
    style d2 fill:#6cc3d5,stroke:#333,color:#fff
    style d3 fill:#ffce67,stroke:#333
    style d100 fill:#ff7851,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="why-this-matters-distances-become-meaningless" class="level3">
<h3 class="anchored" data-anchor-id="why-this-matters-distances-become-meaningless">Why This Matters: Distances Become Meaningless</h3>
<p>In high dimensions, the ratio of maximum to minimum distance between any pair of points approaches 1:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Clim_%7Bd%20%5Cto%20%5Cinfty%7D%20%5Cfrac%7Bdist_%7Bmax%7D%20-%20dist_%7Bmin%7D%7D%7Bdist_%7Bmin%7D%7D%20%5Cto%200"></p>
<p>This means <strong>all points are approximately equidistant</strong>, which destroys distance-based algorithms.</p>
</section>
<section id="example-knn-fails-in-high-dimensions" class="level3">
<h3 class="anchored" data-anchor-id="example-knn-fails-in-high-dimensions">Example: KNN Fails in High Dimensions</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.neighbors <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> KNeighborsClassifier</span>
<span id="cb4-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> make_classification</span>
<span id="cb4-3"></span>
<span id="cb4-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Low dimension: KNN works great</span></span>
<span id="cb4-5">X_low, y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> make_classification(n_features<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, n_informative<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb4-6">knn <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> KNeighborsClassifier(n_neighbors<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb4-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Accuracy: 92%</span></span>
<span id="cb4-8"></span>
<span id="cb4-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># High dimension: KNN fails</span></span>
<span id="cb4-10">X_high, y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> make_classification(n_features<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>, n_informative<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb4-11">knn <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> KNeighborsClassifier(n_neighbors<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb4-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Accuracy: 55% — barely better than random!</span></span>
<span id="cb4-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Because with 500 features, "nearest" neighbors aren't really near</span></span></code></pre></div></div>
</section>
<section id="models-most-affected" class="level3">
<h3 class="anchored" data-anchor-id="models-most-affected">Models Most Affected</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 42%">
<col style="width: 45%">
<col style="width: 11%">
</colgroup>
<thead>
<tr class="header">
<th>Severely affected</th>
<th>Somewhat resilient</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>KNN</td>
<td>Decision Trees</td>
<td>Trees split on one feature at a time</td>
</tr>
<tr class="even">
<td>K-Means</td>
<td>Random Forest</td>
<td>Feature subsampling helps</td>
</tr>
<tr class="odd">
<td>SVM (RBF kernel)</td>
<td>Gradient Boosting</td>
<td>Sequential error correction</td>
</tr>
<tr class="even">
<td>Gaussian processes</td>
<td>Neural Networks (with dropout)</td>
<td>Learn relevant subspaces</td>
</tr>
</tbody>
</table>
</section>
<section id="solutions" class="level3">
<h3 class="anchored" data-anchor-id="solutions">Solutions</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    CURSE["Curse of&lt;br/&gt;Dimensionality"] --&gt; S1["Feature Selection&lt;br/&gt;(keep only&lt;br/&gt;informative features)"]
    CURSE --&gt; S2["PCA / Autoencoders&lt;br/&gt;(project to&lt;br/&gt;lower dimensions)"]
    CURSE --&gt; S3["Regularization&lt;br/&gt;(L1 drives irrelevant&lt;br/&gt;weights to zero)"]
    CURSE --&gt; S4["Domain Knowledge&lt;br/&gt;(only include&lt;br/&gt;meaningful features)"]
    CURSE --&gt; S5["Get More Data&lt;br/&gt;(fill the space better)"]

    style S1 fill:#56cc9d,stroke:#333,color:#fff
    style S2 fill:#6cc3d5,stroke:#333,color:#fff
    style S3 fill:#ffce67,stroke:#333
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="application-4" class="level3">
<h3 class="anchored" data-anchor-id="application-4">Application</h3>
<ul>
<li><strong>Genomics:</strong> 20,000 genes, 100 patients — need aggressive feature selection</li>
<li><strong>Text/NLP:</strong> Bag-of-words creates 100K+ features — use TF-IDF + dimensionality reduction</li>
<li><strong>Image data:</strong> Raw pixels (millions of dimensions) — use CNNs to learn lower-dimensional representations</li>
<li><strong>Recommendation systems:</strong> Millions of items → embedding spaces reduce dimensionality</li>
</ul>
<hr>
</section>
</section>
<section id="q6-explain-pca-principal-component-analysis." class="level2">
<h2 class="anchored" data-anchor-id="q6-explain-pca-principal-component-analysis.">Q6: Explain PCA (Principal Component Analysis).</h2>
<p><strong>Answer:</strong></p>
<p>PCA is an unsupervised technique that finds the <strong>directions of maximum variance</strong> in the data and projects data onto a lower-dimensional subspace.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    A["Original data&lt;br/&gt;(d dimensions)"] --&gt; B["Standardize features&lt;br/&gt;(mean=0, std=1)"]
    B --&gt; C["Compute covariance matrix&lt;br/&gt;(d × d)"]
    C --&gt; D["Find eigenvectors &amp; eigenvalues"]
    D --&gt; E["Sort by eigenvalue&lt;br/&gt;(variance explained)"]
    E --&gt; F["Select top k components&lt;br/&gt;(capture 95% variance)"]
    F --&gt; G["Project data onto k dimensions"]

    style A fill:#6cc3d5,stroke:#333,color:#fff
    style F fill:#56cc9d,stroke:#333,color:#fff
    style G fill:#ffce67,stroke:#333
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="how-it-works-intuition" class="level3">
<h3 class="anchored" data-anchor-id="how-it-works-intuition">How It Works: Intuition</h3>
<p>Imagine data scattered in 3D space but most of the spread is in a 2D plane. PCA finds that plane (the directions of maximum variance) and projects all points onto it — reducing from 3D to 2D with minimal information loss.</p>
</section>
<section id="example-dimensionality-reduction-for-visualization" class="level3">
<h3 class="anchored" data-anchor-id="example-dimensionality-reduction-for-visualization">Example: Dimensionality Reduction for Visualization</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.decomposition <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> PCA</span>
<span id="cb5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.preprocessing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> StandardScaler</span>
<span id="cb5-3"></span>
<span id="cb5-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Original: 50 features</span></span>
<span id="cb5-5">X_scaled <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> StandardScaler().fit_transform(X)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Always scale first!</span></span>
<span id="cb5-6"></span>
<span id="cb5-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Reduce to 2D for visualization</span></span>
<span id="cb5-8">pca <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PCA(n_components<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb5-9">X_2d <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pca.fit_transform(X_scaled)</span>
<span id="cb5-10"></span>
<span id="cb5-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># How much information is preserved?</span></span>
<span id="cb5-12"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Variance explained: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>pca<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>explained_variance_ratio_<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>()<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2%}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb5-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Output: "Variance explained: 72.4%"</span></span>
<span id="cb5-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># → 2 components capture 72.4% of the total variance</span></span>
<span id="cb5-15"></span>
<span id="cb5-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># For modeling: find k that captures 95%</span></span>
<span id="cb5-17">pca_95 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PCA(n_components<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.95</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># auto-select k</span></span>
<span id="cb5-18">X_reduced <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pca_95.fit_transform(X_scaled)</span>
<span id="cb5-19"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Components needed for 95%: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>pca_95<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>n_components_<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb5-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Output: "Components needed for 95%: 12"</span></span>
<span id="cb5-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># → Reduced from 50 to 12 features!</span></span></code></pre></div></div>
</section>
<section id="when-to-use-and-when-not" class="level3">
<h3 class="anchored" data-anchor-id="when-to-use-and-when-not">When to Use and When Not</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    PCA_NODE["PCA"] --&gt; USE["✅ Use when"]
    PCA_NODE --&gt; AVOID["❌ Avoid when"]

    USE --&gt; U1["Features are correlated&lt;br/&gt;(redundant)"]
    USE --&gt; U2["Visualization needed&lt;br/&gt;(reduce to 2-3D)"]
    USE --&gt; U3["Speed up training&lt;br/&gt;(fewer features)"]
    USE --&gt; U4["Reduce noise&lt;br/&gt;(drop low-variance components)"]

    AVOID --&gt; A1["Features are&lt;br/&gt;already independent"]
    AVOID --&gt; A2["Interpretability is critical&lt;br/&gt;(components are&lt;br/&gt;hard to explain)"]
    AVOID --&gt; A3["Non-linear relationships&lt;br/&gt;dominate (use t-SNE,&lt;br/&gt;UMAP, or autoencoders)"]

    style USE fill:#56cc9d,stroke:#333,color:#fff
    style AVOID fill:#ff7851,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="application-5" class="level3">
<h3 class="anchored" data-anchor-id="application-5">Application</h3>
<ul>
<li><strong>Image compression:</strong> Reduce image from 784 pixels (28×28) to 50 components</li>
<li><strong>Genomics:</strong> Visualize population structure from thousands of genetic markers</li>
<li><strong>Finance:</strong> Identify latent factors driving asset returns</li>
<li><strong>Preprocessing:</strong> Remove multicollinearity before linear regression</li>
</ul>
<hr>
</section>
</section>
<section id="q7-what-is-the-difference-between-generative-and-discriminative-models" class="level2">
<h2 class="anchored" data-anchor-id="q7-what-is-the-difference-between-generative-and-discriminative-models">Q7: What is the difference between generative and discriminative models?</h2>
<p><strong>Answer:</strong></p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph disc["Discriminative Model"]
        direction TB
        D1["Learns: P(y|x) directly"]
        D2["'Given features,&lt;br/&gt;what's the class?'"]
        D3["Draws decision boundary"]
    end

    disc --&gt; D_EX["Examples:&lt;br/&gt;• Logistic Regression&lt;br/&gt;• SVM&lt;br/&gt;• Neural Networks&lt;br/&gt;• Random Forest"]

    style disc fill:#6cc3d5,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    subgraph gen["Generative Model"]
        direction TB
        G1["Learns: P(x|y) and P(y)"]
        G2["'What does each&lt;br/&gt;class look like?'"]
        G3["Models full data distribution"]
    end

    gen --&gt; G_EX["Examples:&lt;br/&gt;• Naive Bayes&lt;br/&gt;• Gaussian Mixture Models&lt;br/&gt;• VAE, GANs&lt;br/&gt;• Hidden Markov Models"]

    style gen fill:#56cc9d,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="intuition-cat-vs.-dog-classifier" class="level3">
<h3 class="anchored" data-anchor-id="intuition-cat-vs.-dog-classifier">Intuition: Cat vs.&nbsp;Dog Classifier</h3>
<p><strong>Discriminative approach:</strong> Learn the boundary between cats and dogs. “This side = cat, that side = dog.” Doesn’t know what a cat or dog looks like — just where the line is.</p>
<p><strong>Generative approach:</strong> Learn what cats look like (fur patterns, ear shapes) and what dogs look like separately. Classify new images by asking “Does this look more like a cat or a dog?” Can also <strong>generate</strong> new cat/dog images.</p>
</section>
<section id="understanding-the-math" class="level3">
<h3 class="anchored" data-anchor-id="understanding-the-math">Understanding the Math</h3>
<p><strong>Discriminative — models <img src="https://latex.codecogs.com/png.latex?P(y%7Cx)"> directly:</strong></p>
<ul>
<li>Asks: “Given these input features <img src="https://latex.codecogs.com/png.latex?x">, what is the probability of each class <img src="https://latex.codecogs.com/png.latex?y">?”</li>
<li>Example: Given an email’s word frequencies, directly output <img src="https://latex.codecogs.com/png.latex?P(%5Ctext%7Bspam%7D%20%7C%20%5Ctext%7Bwords%7D)%20=%200.87"></li>
<li>Learns the decision boundary without modeling how the data was generated</li>
</ul>
<p><strong>Generative — models <img src="https://latex.codecogs.com/png.latex?P(x%7Cy)%20%5Ccdot%20P(y)"> then applies Bayes’ rule:</strong></p>
<p><img src="https://latex.codecogs.com/png.latex?P(y%7Cx)%20=%20%5Cfrac%7BP(x%7Cy)%20%5Ccdot%20P(y)%7D%7BP(x)%7D"></p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?P(x%7Cy)"> = <strong>likelihood</strong> — “What does data from class <img src="https://latex.codecogs.com/png.latex?y"> look like?” (e.g., what word patterns do spam emails have?)</li>
<li><img src="https://latex.codecogs.com/png.latex?P(y)"> = <strong>prior</strong> — “How common is class <img src="https://latex.codecogs.com/png.latex?y">?” (e.g., 20% of all emails are spam)</li>
<li><img src="https://latex.codecogs.com/png.latex?P(x)"> = <strong>evidence</strong> — normalizing constant (same for all classes, often ignored)</li>
<li>To classify: compute <img src="https://latex.codecogs.com/png.latex?P(x%7Cy)%20%5Ccdot%20P(y)"> for each class, pick the highest</li>
</ul>
</section>
<section id="example-spam-detection-two-approaches" class="level3">
<h3 class="anchored" data-anchor-id="example-spam-detection-two-approaches">Example: Spam Detection — Two Approaches</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Discriminative: Logistic Regression</span></span>
<span id="cb6-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Learns P(spam | words) directly</span></span>
<span id="cb6-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.linear_model <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> LogisticRegression</span>
<span id="cb6-4">lr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LogisticRegression()</span>
<span id="cb6-5">lr.fit(X_tfidf, y_labels)</span>
<span id="cb6-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Finds the decision boundary in word-frequency space</span></span>
<span id="cb6-7"></span>
<span id="cb6-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generative: Naive Bayes</span></span>
<span id="cb6-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Learns P(words | spam) and P(words | not_spam) separately</span></span>
<span id="cb6-10"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.naive_bayes <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> MultinomialNB</span>
<span id="cb6-11">nb <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MultinomialNB()</span>
<span id="cb6-12">nb.fit(X_tfidf, y_labels)</span>
<span id="cb6-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Models how spam emails "look" vs. how normal emails "look"</span></span>
<span id="cb6-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Classifies using Bayes' rule: P(spam|words) ∝ P(words|spam)·P(spam)</span></span></code></pre></div></div>
</section>
<section id="comparison" class="level3">
<h3 class="anchored" data-anchor-id="comparison">Comparison</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 22%">
<col style="width: 42%">
<col style="width: 34%">
</colgroup>
<thead>
<tr class="header">
<th>Aspect</th>
<th>Discriminative</th>
<th>Generative</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>What it models</td>
<td>P(y|x) — boundary</td>
<td>P(x|y)·P(y) — full distribution</td>
</tr>
<tr class="even">
<td>Accuracy with enough data</td>
<td>Usually higher</td>
<td>Often lower for classification</td>
</tr>
<tr class="odd">
<td>Small data performance</td>
<td>Can struggle</td>
<td>Often better (stronger assumptions help)</td>
</tr>
<tr class="even">
<td>Can generate new data?</td>
<td>No</td>
<td>Yes</td>
</tr>
<tr class="odd">
<td>Handles missing features</td>
<td>Poorly</td>
<td>Naturally (marginalize out)</td>
</tr>
<tr class="even">
<td>Training efficiency</td>
<td>Focuses only on boundary</td>
<td>Models more than needed for classification</td>
</tr>
</tbody>
</table>
</section>
<section id="application-6" class="level3">
<h3 class="anchored" data-anchor-id="application-6">Application</h3>
<ul>
<li><strong>Discriminative:</strong> Most production classification tasks (credit scoring, image classification, NLP)</li>
<li><strong>Generative:</strong> Data augmentation (GANs), anomaly detection, handling missing data, text generation (GPT), drug discovery</li>
<li><strong>Modern trend:</strong> Generative AI (LLMs, diffusion models) uses generative models for creation, while discriminative models remain dominant for classification/prediction tasks</li>
</ul>
<hr>
</section>
</section>
<section id="q8-what-is-gradient-boosting-and-how-does-xgboost-work" class="level2">
<h2 class="anchored" data-anchor-id="q8-what-is-gradient-boosting-and-how-does-xgboost-work">Q8: What is gradient boosting and how does XGBoost work?</h2>
<p><strong>Answer:</strong></p>
<p>Gradient boosting sequentially builds an ensemble where each new model <strong>corrects the residual errors</strong> of the previous ensemble.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    A["Training Data&lt;br/&gt;(X, y)"] --&gt; B["Model 1: Simple tree&lt;br/&gt;Prediction: ŷ₁"]
    B --&gt; C["Compute residuals&lt;br/&gt;r₁ = y - ŷ₁"]
    C --&gt; D["Model 2: Fit residuals r₁&lt;br/&gt;Prediction: ŷ₂"]
    D --&gt; E["Compute residuals&lt;br/&gt;r₂ = y - (ŷ₁ + η·ŷ₂)"]
    E --&gt; F["Model 3: Fit residuals r₂&lt;br/&gt;Prediction: ŷ₃"]
    F --&gt; G["...continue..."]
    G --&gt; H["Final: ŷ = ŷ₁ + η·ŷ₂ + η·ŷ₃ + ..."]

    style A fill:#6cc3d5,stroke:#333,color:#fff
    style H fill:#56cc9d,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="how-xgboost-improves-gradient-boosting" class="level3">
<h3 class="anchored" data-anchor-id="how-xgboost-improves-gradient-boosting">How XGBoost Improves Gradient Boosting</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    GB["Standard&lt;br/&gt;Gradient Boosting"] --&gt; XGB["XGBoost&lt;br/&gt;Improvements"]

    XGB --&gt; I1["Regularization&lt;br/&gt;(L1 + L2 on&lt;br/&gt;leaf weights)"]
    XGB --&gt; I2["Second-order gradients&lt;br/&gt;(Newton's method&lt;br/&gt;— faster convergence)"]
    XGB --&gt; I3["Column subsampling&lt;br/&gt;(like Random Forest&lt;br/&gt;— reduces overfitting)"]
    XGB --&gt; I4["Built-in missing&lt;br/&gt;value handling&lt;br/&gt;(learns optimal direction)"]
    XGB --&gt; I5["Tree pruning&lt;br/&gt;(max_depth +&lt;br/&gt;gain-based pruning)"]
    XGB --&gt; I6["Parallel feature&lt;br/&gt;computation&lt;br/&gt;(fast training)"]

    style XGB fill:#56cc9d,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="example-house-price-prediction" class="level3">
<h3 class="anchored" data-anchor-id="example-house-price-prediction">Example: House Price Prediction</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> xgboost <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> xgb</span>
<span id="cb7-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> cross_val_score</span>
<span id="cb7-3"></span>
<span id="cb7-4">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> xgb.XGBRegressor(</span>
<span id="cb7-5">    n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>,        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 500 sequential trees</span></span>
<span id="cb7-6">    max_depth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,             <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># shallow trees (high bias per tree)</span></span>
<span id="cb7-7">    learning_rate<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>,      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># shrinkage — small steps</span></span>
<span id="cb7-8">    subsample<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>,           <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 80% of rows per tree</span></span>
<span id="cb7-9">    colsample_bytree<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>,   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 80% of features per tree</span></span>
<span id="cb7-10">    reg_alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>,           <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># L1 regularization</span></span>
<span id="cb7-11">    reg_lambda<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.0</span>,          <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># L2 regularization</span></span>
<span id="cb7-12">    early_stopping_rounds<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># stop if no improvement</span></span>
<span id="cb7-13">)</span>
<span id="cb7-14"></span>
<span id="cb7-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># With eval set for early stopping</span></span>
<span id="cb7-16">model.fit(</span>
<span id="cb7-17">    X_train, y_train,</span>
<span id="cb7-18">    eval_set<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[(X_val, y_val)],</span>
<span id="cb7-19">    verbose<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span></span>
<span id="cb7-20">)</span>
<span id="cb7-21"></span>
<span id="cb7-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Result: RMSE improved from $45K (single tree) to $18K (XGBoost)</span></span>
<span id="cb7-23"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Best iteration: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>model<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>best_iteration<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Stopped at 312 trees</span></span></code></pre></div></div>
</section>
<section id="key-hyperparameters-and-tuning-order" class="level3">
<h3 class="anchored" data-anchor-id="key-hyperparameters-and-tuning-order">Key Hyperparameters and Tuning Order</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 27%">
<col style="width: 30%">
<col style="width: 19%">
<col style="width: 22%">
</colgroup>
<thead>
<tr class="header">
<th>Priority</th>
<th>Parameter</th>
<th>Range</th>
<th>Effect</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>1st</td>
<td><code>learning_rate</code></td>
<td>0.01-0.3</td>
<td>Lower = more robust but needs more trees</td>
</tr>
<tr class="even">
<td>1st</td>
<td><code>n_estimators</code></td>
<td>100-5000</td>
<td>Use early stopping to find optimal</td>
</tr>
<tr class="odd">
<td>2nd</td>
<td><code>max_depth</code></td>
<td>3-8</td>
<td>Controls tree complexity</td>
</tr>
<tr class="even">
<td>2nd</td>
<td><code>subsample</code></td>
<td>0.6-1.0</td>
<td>Row sampling (regularization)</td>
</tr>
<tr class="odd">
<td>3rd</td>
<td><code>colsample_bytree</code></td>
<td>0.6-1.0</td>
<td>Feature sampling (regularization)</td>
</tr>
<tr class="even">
<td>3rd</td>
<td><code>reg_alpha</code>, <code>reg_lambda</code></td>
<td>0-10</td>
<td>Weight penalties</td>
</tr>
</tbody>
</table>
</section>
<section id="application-7" class="level3">
<h3 class="anchored" data-anchor-id="application-7">Application</h3>
<ul>
<li><strong>Kaggle competitions:</strong> XGBoost/LightGBM win majority of tabular data competitions</li>
<li><strong>Industry standard:</strong> Fraud detection, credit scoring, recommendation ranking</li>
<li><strong>When to use:</strong> Tabular data with &lt; 1M rows (for larger data, prefer LightGBM)</li>
<li><strong>When NOT to use:</strong> Image/text/audio data (use deep learning), very small data (use simpler models)</li>
</ul>
<hr>
</section>
</section>
<section id="q9-how-do-you-handle-missing-data" class="level2">
<h2 class="anchored" data-anchor-id="q9-how-do-you-handle-missing-data">Q9: How do you handle missing data?</h2>
<p><strong>Answer:</strong></p>
<p>Missing data handling requires understanding <strong>why</strong> data is missing before choosing a strategy.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    MISSING["Missing Data"] --&gt; TYPE["Understand the type"]

    TYPE --&gt; MCAR["MCAR&lt;br/&gt;Missing Completely&lt;br/&gt;at Random&lt;br/&gt;(no pattern)"]
    TYPE --&gt; MAR["MAR&lt;br/&gt;Missing at Random&lt;br/&gt;(depends on&lt;br/&gt;observed features)"]
    TYPE --&gt; MNAR["MNAR&lt;br/&gt;Missing Not at Random&lt;br/&gt;(depends on the&lt;br/&gt;missing value itself)"]

    MCAR --&gt; MCAR_EX["Example: Sensor&lt;br/&gt;randomly fails&lt;br/&gt;→ Safe to drop or impute"]
    MAR --&gt; MAR_EX["Example: Rich people&lt;br/&gt;skip income question&lt;br/&gt;→ Impute using other features"]
    MNAR --&gt; MNAR_EX["Example: Sick patients&lt;br/&gt;miss appointments&lt;br/&gt;→ Missingness IS informative"]

    style MCAR fill:#56cc9d,stroke:#333,color:#fff
    style MAR fill:#6cc3d5,stroke:#333,color:#fff
    style MNAR fill:#ff7851,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="strategies-decision-tree" class="level3">
<h3 class="anchored" data-anchor-id="strategies-decision-tree">Strategies Decision Tree</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    Q1{"How much is missing?"} --&gt;|"&gt;50% of column"| DROP_COL["Drop the column"]
    Q1 --&gt;|"&lt;5% of rows"| DROP_ROW["Drop rows&lt;br/&gt;(if MCAR)"]
    Q1 --&gt;|"5-50%"| Q2{"What type of feature?"}

    Q2 --&gt;|"Numerical"| Q3{"Distribution?"}
    Q2 --&gt;|"Categorical"| CAT["Mode or 'Unknown' category"]

    Q3 --&gt;|"Symmetric"| MEAN["Mean imputation"]
    Q3 --&gt;|"Skewed / outliers"| MEDIAN["Median imputation"]
    Q3 --&gt;|"Complex patterns"| MODEL["Model-based&lt;br/&gt;(KNN, Iterative)"]

    DROP_COL --&gt; FLAG["+ Add missingness indicator&lt;br/&gt;if MNAR suspected"]
    MEDIAN --&gt; FLAG

    style FLAG fill:#ffce67,stroke:#333, color:#000
    style MODEL fill:#56cc9d,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="example-customer-data-with-missing-values" class="level3">
<h3 class="anchored" data-anchor-id="example-customer-data-with-missing-values">Example: Customer Data with Missing Values</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb8-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.impute <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> SimpleImputer, KNNImputer</span>
<span id="cb8-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.pipeline <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Pipeline</span>
<span id="cb8-4"></span>
<span id="cb8-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Dataset:</span></span>
<span id="cb8-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># age: 2% missing (random sensor error) → MCAR</span></span>
<span id="cb8-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># income: 15% missing (high earners skip) → MAR</span></span>
<span id="cb8-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># credit_score: 30% missing (new customers) → MNAR</span></span>
<span id="cb8-9"></span>
<span id="cb8-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Strategy 1: Simple imputation</span></span>
<span id="cb8-11">imputer_age <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> SimpleImputer(strategy<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'median'</span>)    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># robust to outliers</span></span>
<span id="cb8-12">imputer_income <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> KNNImputer(n_neighbors<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># use similar customers</span></span>
<span id="cb8-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># For credit_score: add a flag + impute</span></span>
<span id="cb8-14"></span>
<span id="cb8-15">df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'credit_score_missing'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'credit_score'</span>].isna().astype(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># flag</span></span>
<span id="cb8-16">df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'credit_score'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'credit_score'</span>].fillna(df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'credit_score'</span>].median())</span>
<span id="cb8-17"></span>
<span id="cb8-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># CRITICAL: fit imputers on TRAINING data only!</span></span>
<span id="cb8-19"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> train_test_split</span>
<span id="cb8-20">X_train, X_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> train_test_split(X)</span>
<span id="cb8-21"></span>
<span id="cb8-22">imputer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> KNNImputer(n_neighbors<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb8-23">X_train_imputed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> imputer.fit_transform(X_train)   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># fit + transform</span></span>
<span id="cb8-24">X_test_imputed <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> imputer.transform(X_test)         <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># only transform!</span></span></code></pre></div></div>
</section>
<section id="common-mistakes" class="level3">
<h3 class="anchored" data-anchor-id="common-mistakes">Common Mistakes</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 31%">
<col style="width: 51%">
<col style="width: 17%">
</colgroup>
<thead>
<tr class="header">
<th>Mistake</th>
<th>Why it’s wrong</th>
<th>Fix</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Impute before splitting</td>
<td>Leaks test info into training</td>
<td>Split first, fit imputer on train only</td>
</tr>
<tr class="even">
<td>Use mean for skewed data</td>
<td>Mean pulled by outliers</td>
<td>Use median</td>
</tr>
<tr class="odd">
<td>Drop all missing rows</td>
<td>Loses data + introduces bias</td>
<td>Impute or flag</td>
</tr>
<tr class="even">
<td>Ignore MNAR patterns</td>
<td>Loses predictive signal</td>
<td>Add missingness indicator</td>
</tr>
<tr class="odd">
<td>Impute time series with future</td>
<td>Temporal leakage</td>
<td>Use forward-fill or rolling window</td>
</tr>
</tbody>
</table>
</section>
<section id="application-8" class="level3">
<h3 class="anchored" data-anchor-id="application-8">Application</h3>
<ul>
<li><strong>Healthcare:</strong> Patient data often has MNAR (sicker patients have more missing tests) — missingness flag is critical</li>
<li><strong>Surveys:</strong> Income/age often MAR — use KNN imputer with demographic features</li>
<li><strong>IoT/Sensors:</strong> Usually MCAR — simple median/interpolation works</li>
<li><strong>Production systems:</strong> Build imputation into the ML pipeline (sklearn Pipeline) so it’s applied consistently at training and inference time</li>
</ul>
<hr>
</section>
</section>
<section id="q10-what-is-data-leakage-and-how-do-you-prevent-it" class="level2">
<h2 class="anchored" data-anchor-id="q10-what-is-data-leakage-and-how-do-you-prevent-it">Q10: What is data leakage and how do you prevent it?</h2>
<p><strong>Answer:</strong></p>
<p>Data leakage occurs when information that <strong>would not be available at prediction time</strong> is used during training. It inflates metrics offline but causes catastrophic failure in production.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    subgraph leakage["Data Leakage:&lt;br/&gt;What Happens"]
        direction LR
        L1["Training uses&lt;br/&gt;future/target info"] --&gt; L2["Model gets 99%&lt;br/&gt;accuracy offline"]
        L2 --&gt; L3["Deploy to production"]
        L3 --&gt; L4["Performance drops to 60%&lt;br/&gt;❌ FAILURE"]
    end

    style leakage fill:#ff7851,stroke:#333,color:#fff
    linkStyle default stroke:#000
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph LR
    subgraph clean["No Leakage:&lt;br/&gt;What Should Happen"]
        direction LR
        C1["Training uses&lt;br/&gt;only available info"] --&gt; C2["Model gets 85%&lt;br/&gt;accuracy offline"]
        C2 --&gt; C3["Deploy to production"]
        C3 --&gt; C4["Performance stays at 83%&lt;br/&gt;✅ SUCCESS"]
    end

    style clean fill:#56cc9d,stroke:#333,color:#fff
    linkStyle default stroke:#000
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<section id="intuitive-example-predicting-hospital-readmission" class="level3">
<h3 class="anchored" data-anchor-id="intuitive-example-predicting-hospital-readmission">Intuitive Example: Predicting Hospital Readmission</h3>
<p>Imagine you’re building a model to predict whether a patient will be <strong>readmitted within 30 days</strong>.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 37%">
<col style="width: 41%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th>Feature</th>
<th>Leakage?</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Patient age, diagnosis</td>
<td>✅ Safe</td>
<td>Available at discharge</td>
</tr>
<tr class="even">
<td>Length of stay</td>
<td>✅ Safe</td>
<td>Known when patient leaves</td>
</tr>
<tr class="odd">
<td>“Readmission scheduled” flag</td>
<td>❌ <strong>Leakage!</strong></td>
<td>Only exists AFTER readmission happens</td>
</tr>
<tr class="even">
<td>Discharge summary mentioning “follow-up in 2 weeks”</td>
<td>⚠️ Subtle leakage</td>
<td>Written by doctor who already decided on readmission plan</td>
</tr>
<tr class="odd">
<td>Number of future appointments booked</td>
<td>❌ <strong>Leakage!</strong></td>
<td>Created after the prediction point</td>
</tr>
</tbody>
</table>
<p><strong>The key question:</strong> “Would I have this feature at the moment I need to make the prediction?”</p>
<p>If the answer is no — it’s leakage. The model isn’t learning to <em>predict</em> the future; it’s learning to <em>read</em> the future.</p>
</section>
<section id="common-types-of-leakage" class="level3">
<h3 class="anchored" data-anchor-id="common-types-of-leakage">Common Types of Leakage</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    LEAK["Data Leakage Types"] --&gt; T1["Target Leakage&lt;br/&gt;(feature derived from target)"]
    LEAK --&gt; T2["Temporal Leakage&lt;br/&gt;(using future data)"]
    LEAK --&gt; T3["Train-Test Contamination&lt;br/&gt;(preprocessing on full data)"]
    LEAK --&gt; T4["Group Leakage&lt;br/&gt;(same entity in train &amp; test)"]

    T1 --&gt; T1_EX["Example: 'diagnosis_code'&lt;br/&gt;predicting 'has_disease'&lt;br/&gt;(code IS the diagnosis!)"]
    T2 --&gt; T2_EX["Example: Using tomorrow's&lt;br/&gt;stock price as a feature&lt;br/&gt;to predict today's"]
    T3 --&gt; T3_EX["Example: Scaling/encoding&lt;br/&gt;fit on full data before split"]
    T4 --&gt; T4_EX["Example: Same patient in&lt;br/&gt;train &amp; test&lt;br/&gt;(memorizes patient, not pattern)"]

    style T1 fill:#ff7851,stroke:#333,color:#fff
    style T2 fill:#ffce67,stroke:#333
    style T3 fill:#6cc3d5,stroke:#333,color:#fff
    style T4 fill:#56cc9d,stroke:#333,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="example-churn-prediction-with-leakage" class="level3">
<h3 class="anchored" data-anchor-id="example-churn-prediction-with-leakage">Example: Churn Prediction with Leakage</h3>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ❌ LEAKAGE: Feature "days_since_last_login" is computed AFTER the churn event</span></span>
<span id="cb9-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># If someone churned 30 days ago, days_since_last_login = 30</span></span>
<span id="cb9-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># The model is just detecting "they already churned" not "they will churn"</span></span>
<span id="cb9-4"></span>
<span id="cb9-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ❌ LEAKAGE: Scaling before splitting</span></span>
<span id="cb9-6"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.preprocessing <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> StandardScaler</span>
<span id="cb9-7">scaler <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> StandardScaler()</span>
<span id="cb9-8">X_scaled <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> scaler.fit_transform(X)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># fit on ALL data including test</span></span>
<span id="cb9-9">X_train, X_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> train_test_split(X_scaled)</span>
<span id="cb9-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Test data statistics leaked into scaler!</span></span>
<span id="cb9-11"></span>
<span id="cb9-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ✅ CORRECT: Split first, then preprocess</span></span>
<span id="cb9-13">X_train, X_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> train_test_split(X)</span>
<span id="cb9-14">scaler <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> StandardScaler()</span>
<span id="cb9-15">X_train_scaled <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> scaler.fit_transform(X_train)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># fit only on train</span></span>
<span id="cb9-16">X_test_scaled <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> scaler.transform(X_test)        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># transform only</span></span></code></pre></div></div>
</section>
<section id="prevention-checklist" class="level3">
<h3 class="anchored" data-anchor-id="prevention-checklist">Prevention Checklist</h3>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TD
    A["Prevention Strategy"] --&gt; B["1. Split FIRST&lt;br/&gt;before any preprocessing"]
    B --&gt; C["2. Validate feature availability&lt;br/&gt;'Would I have this at inference time?'"]
    C --&gt; D["3. Use time-based splits&lt;br/&gt;for temporal data"]
    D --&gt; E["4. Group by entity&lt;br/&gt;(user, patient, store)"]
    E --&gt; F["5. Sanity check&lt;br/&gt;'Is accuracy suspiciously high?'"]
    F --&gt; G["6. Test with shuffled target&lt;br/&gt;(should give ~50% accuracy)"]

    style A fill:#56cc9d,stroke:#333,color:#000
    style B fill:#56cc9d,stroke:#333,color:#000
    style C fill:#56cc9d,stroke:#333,color:#000
    style D fill:#56cc9d,stroke:#333,color:#000
    style E fill:#56cc9d,stroke:#333,color:#000 
    style F fill:#ffce67,stroke:#333,color:#000
    style G fill:#ff7851,stroke:#333,color:#000
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="red-flags-that-suggest-leakage" class="level3">
<h3 class="anchored" data-anchor-id="red-flags-that-suggest-leakage">Red Flags That Suggest Leakage</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 36%">
<col style="width: 63%">
</colgroup>
<thead>
<tr class="header">
<th>Signal</th>
<th>What to check</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Accuracy &gt; 95% on first attempt</td>
<td>Too good to be true — inspect features</td>
</tr>
<tr class="even">
<td>Single feature dominates importance</td>
<td>May be a proxy for the target</td>
</tr>
<tr class="odd">
<td>Train and test scores are nearly identical</td>
<td>Model may be seeing test info</td>
</tr>
<tr class="even">
<td>Performance drops dramatically in production</td>
<td>Classic leakage symptom</td>
</tr>
<tr class="odd">
<td>Cross-validation scores are unstable</td>
<td>Leakage present in some folds</td>
</tr>
</tbody>
</table>
</section>
<section id="application-9" class="level3">
<h3 class="anchored" data-anchor-id="application-9">Application</h3>
<ul>
<li><strong>Time series:</strong> Always use forward-chaining (train on past, predict future). Never shuffle temporal data.</li>
<li><strong>Medical studies:</strong> Ensure no patient appears in both train and test sets.</li>
<li><strong>Feature stores:</strong> Implement point-in-time correctness — features computed using only data available at prediction time.</li>
<li><strong>ML pipelines:</strong> Use sklearn <code>Pipeline</code> to bundle preprocessing + model, ensuring transforms are fit only on training data during cross-validation.</li>
</ul>
<hr>
</section>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<table class="caption-top table">
<colgroup>
<col style="width: 27%">
<col style="width: 35%">
<col style="width: 37%">
</colgroup>
<thead>
<tr class="header">
<th>Question</th>
<th>Core Concept</th>
<th>Key Takeaway</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Q1</td>
<td>Precision/Recall/F1</td>
<td>Choose metrics based on error costs, not defaults</td>
</tr>
<tr class="even">
<td>Q2</td>
<td>ROC-AUC</td>
<td>Good for ranking; use PR-AUC for imbalanced data</td>
</tr>
<tr class="odd">
<td>Q3</td>
<td>Imbalanced data</td>
<td>Fix metrics first, then weights, then resampling</td>
</tr>
<tr class="even">
<td>Q4</td>
<td>Feature engineering</td>
<td>Better features beat better models — invest here</td>
</tr>
<tr class="odd">
<td>Q5</td>
<td>Curse of dimensionality</td>
<td>High dimensions break distance; reduce or regularize</td>
</tr>
<tr class="even">
<td>Q6</td>
<td>PCA</td>
<td>Find maximum-variance directions; scale first</td>
</tr>
<tr class="odd">
<td>Q7</td>
<td>Generative vs.&nbsp;Discriminative</td>
<td>Discriminative for classification; generative for creation</td>
</tr>
<tr class="even">
<td>Q8</td>
<td>Gradient Boosting/XGBoost</td>
<td>Sequential error correction; king of tabular data</td>
</tr>
<tr class="odd">
<td>Q9</td>
<td>Missing data</td>
<td>Understand WHY it’s missing before choosing how to fix</td>
</tr>
<tr class="even">
<td>Q10</td>
<td>Data leakage</td>
<td>Split first; validate feature availability at inference time</td>
</tr>
</tbody>
</table>
<blockquote class="blockquote">
<p><strong>Previous:</strong> <a href="../../posts/ml-interview/ML-Interview-QA-1.html">ML Interview QA - 1</a> covers learning paradigms, bias-variance, overfitting, regularization, gradient descent, cross-validation, logistic regression, decision trees, Random Forest, and bagging vs.&nbsp;boosting.</p>
</blockquote>
<!-- nav-buttons -->
<p><a href="../../posts/ml-interview/ML-Interview-QA-1.html" class="btn btn-primary btn-lg"><i class="fa-solid fa-brain" aria-label="brain"></i> ML Interview QA - 1</a> <a href="../../index.html" class="btn btn-primary btn-lg"><i class="fa-solid fa-house" aria-label="house"></i> Home</a></p>


</section>

 ]]></description>
  <guid>https://vectoringai.com/posts/ml-interview/ML-Interview-QA-2.html</guid>
  <pubDate>Sun, 17 May 2026 00:00:00 GMT</pubDate>
  <media:content url="https://vectoringai.com/images/ml-interview/thumb_ML_interview_qa_300.png" medium="image" type="image/png" height="96" width="144"/>
</item>
</channel>
</rss>
