Vectoring AI

LLM Interview QA - 1

Vectoring AI — Wed, 20 May 2026 00:00:00 GMT

Introduction

This is Part 1 of our LLM Interview QA series. It covers 10 foundational questions that appear in nearly every LLM Engineer, AI Engineer, and Applied ML interview — from startups to FAANG. Each answer goes beyond surface-level definitions with diagrams, concrete examples, and real-world applications.

This series complements our ML Interview series. For foundational machine learning concepts, see ML Interview QA - 1. For evaluation metrics and feature engineering, see ML Interview QA - 2.

Q1: What is the Transformer architecture and why did it replace RNNs/LSTMs?

Answer:

The Transformer is a neural network architecture introduced in the 2017 paper “Attention Is All You Need”. It relies entirely on self-attention mechanisms instead of recurrence or convolution to model dependencies in sequences.

graph LR
    subgraph Transformer["Transformer Architecture"]
        direction TB
        INPUT["Input Embeddings 
+ Positional Encoding"]
        ENC["Encoder Stack (N layers)"]
        DEC["Decoder Stack (N layers)"]
        OUTPUT["Output Probabilities"]

        INPUT --> ENC
        ENC --> DEC
        DEC --> OUTPUT
    end

    subgraph Encoder_Layer["Each Encoder Layer"]
        SA["Multi-Head Self-Attention"]
        FFN["Feed-Forward Network"]
        LN1["Layer Norm + Residual"]
        LN2["Layer Norm + Residual"]

        SA --> LN1 --> FFN --> LN2
    end

    subgraph Decoder_Layer["Each Decoder Layer"]
        MSA["Masked Multi-Head Self-Attention"]
        CA["Cross-Attention (to Encoder)"]
        FFN2["Feed-Forward Network"]

        MSA --> CA --> FFN2
    end

    style Transformer fill:#56cc9d,stroke:#333,color:#fff
    style Encoder_Layer fill:#6cc3d5,stroke:#333,color:#fff
    style Decoder_Layer fill:#ffce67,stroke:#333

Why Transformers replaced RNNs/LSTMs

Aspect	RNN/LSTM	Transformer
Parallelization	Sequential (word by word)	Fully parallel
Long-range dependencies	Struggles (vanishing gradient)	Handles via attention
Training speed	Slow	Much faster on GPUs
Context window	Limited by hidden state	Limited by memory (can be very large)
Positional info	Implicit in sequence order	Explicit positional encoding

Key Insight

RNNs process tokens sequentially — to understand the relationship between the first and last word in a sentence, information must pass through every intermediate hidden state. Transformers compute attention scores between all pairs of tokens simultaneously, making them vastly more efficient and effective at capturing long-range dependencies.

Q2: How does the Self-Attention mechanism work?

Answer:

Self-attention allows each token in a sequence to attend to every other token, computing a weighted sum of their representations based on relevance.

The QKV Framework

For each input token, three vectors are computed:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What information do I provide?”

The attention score is computed as:

where is the dimension of the key vectors (the scaling factor prevents dot products from growing too large).

graph LR
    subgraph Self_Attention["Self-Attention Computation"]
        I["Input Embeddings"] --> Q["Q = X · W_Q"]
        I --> K["K = X · W_K"]
        I --> V["V = X · W_V"]
        Q --> DOT["Q · K^T"]
        K --> DOT
        DOT --> SCALE["÷ √d_k"]
        SCALE --> SOFT["Softmax"]
        SOFT --> MUL["× V"]
        V --> MUL
        MUL --> OUT["Output"]
    end

    style Self_Attention fill:#6cc3d5,stroke:#333,color:#fff

Multi-Head Attention

Instead of one attention function, Transformers run multiple attention heads in parallel (e.g., 8 or 16 heads). Each head learns different relationships:

One head might learn syntactic relationships (subject-verb)
Another might learn coreference (pronouns to their antecedents)
Another might learn positional proximity

The outputs of all heads are concatenated and linearly projected.

Example

For the sentence: “The cat sat on the mat because it was tired”

The self-attention mechanism helps the model understand that “it” refers to “the cat” — the attention weight between “it” and “cat” will be high, while the weight between “it” and “mat” will be lower.

Q3: What is tokenization and what are the main tokenization strategies used in LLMs?

Answer:

Tokenization is the process of splitting text into smaller units (tokens) that the model can process. Tokens are the fundamental input units for LLMs.

Main Tokenization Strategies

Strategy	Description	Example (“unhappiness”)	Used By
Word-level	Split by spaces/punctuation	[“unhappiness”]	Early models
Character-level	Each character is a token	[“u”,“n”,“h”,“a”,“p”,“p”,“i”,“n”,“e”,“s”,“s”]	Some small models
BPE (Byte Pair Encoding)	Iteratively merge frequent character pairs	[“un”, “happiness”]	GPT-2, GPT-3, GPT-4
WordPiece	Like BPE but maximizes likelihood	[“un”, “##happiness”]	BERT
SentencePiece/Unigram	Probabilistic subword model	[“▁un”, “happi”, “ness”]	T5, LLaMA

graph TD
    TEXT["Raw Text: 'The cats are playing'"]
    TEXT --> WL["Word-level: ['The', 'cats', 'are', 'playing']"]
    TEXT --> BPE["BPE: ['The', ' c', 'ats', ' are', ' play', 'ing']"]
    TEXT --> WP["WordPiece: ['The', 'cats', 'are', 'play', '##ing']"]

    style TEXT fill:#56cc9d,stroke:#333,color:#fff
    style BPE fill:#6cc3d5,stroke:#333,color:#fff
    style WP fill:#ffce67,stroke:#333

Why Subword Tokenization?

Handles unknown words: Can represent any word by breaking it into known subwords
Efficient vocabulary: Balances vocabulary size with sequence length
Morphological awareness: Captures meaningful parts (prefixes, suffixes, roots)

Practical Considerations

1 token ≈ 4 characters (English) or ≈ 0.75 words
Vocabulary sizes: GPT-4 uses ~100k tokens, LLaMA uses ~32k tokens
Non-English languages and code often require more tokens per word

Q4: What is the difference between Encoder-only, Decoder-only, and Encoder-Decoder models?

Answer:

graph LR
    subgraph EO["Encoder-Only"]
        EO1["Bidirectional attention"]
        EO2["Sees all tokens at once"]
        EO3["Best for: Understanding"]
        EO4["Examples: BERT, RoBERTa"]
    end

    subgraph DO["Decoder-Only"]
        DO1["Causal (left-to-right) attention"]
        DO2["Each token sees only prior tokens"]
        DO3["Best for: Generation"]
        DO4["Examples: GPT-4, LLaMA, Claude"]
    end

    subgraph ED["Encoder-Decoder"]
        ED1["Encoder: bidirectional"]
        ED2["Decoder: causal + cross-attention"]
        ED3["Best for: Seq-to-Seq tasks"]
        ED4["Examples: T5, BART, Flan-T5"]
    end

    style EO fill:#56cc9d,stroke:#333,color:#fff
    style DO fill:#6cc3d5,stroke:#333,color:#fff
    style ED fill:#ffce67,stroke:#333

Detailed Comparison

Aspect	Encoder-Only	Decoder-Only	Encoder-Decoder
Attention	Bidirectional	Causal (masked)	Both
Pre-training	Masked Language Modeling	Next Token Prediction	Span corruption / denoising
Strengths	Classification, NER, embeddings	Text generation, reasoning	Translation, summarization
Context	Full input visibility	Only left context	Full input → sequential output
Scaling trend	Less common at scale	Dominant paradigm (GPT-4, Claude)	Used for specific tasks (T5)

Why Decoder-Only Dominates Today

Most modern LLMs (GPT-4, Claude, LLaMA, Gemini) are decoder-only because:

Simplicity: One unified architecture for all tasks
Scalability: Easier to scale with more parameters
Generality: Can handle classification, generation, and reasoning via prompting
Emergent abilities: Larger decoder-only models exhibit chain-of-thought reasoning

Q5: What is fine-tuning and what are the main approaches for adapting LLMs?

Answer:

Fine-tuning is the process of further training a pre-trained LLM on a specific dataset or task to customize its behavior.

graph TD
    PT["Pre-trained LLM
(trained on internet-scale data)"]
    PT --> FFT["Full Fine-Tuning
Update ALL parameters"]
    PT --> PEFT["Parameter-Efficient Fine-Tuning
Update FEW parameters"]
    PT --> RLHF_node["RLHF / Alignment
Human preference training"]

    PEFT --> LORA["LoRA"]
    PEFT --> PREFIX["Prefix Tuning"]
    PEFT --> ADAPTER["Adapters"]
    PEFT --> QLORA["QLoRA"]

    style PT fill:#56cc9d,stroke:#333,color:#fff
    style FFT fill:#ff7851,stroke:#333,color:#fff
    style PEFT fill:#6cc3d5,stroke:#333,color:#fff
    style RLHF_node fill:#ffce67,stroke:#333

Fine-Tuning Approaches

Approach	What it does	Parameters Updated	Cost
Full Fine-Tuning	Updates all model weights	100%	Very high (multiple GPUs)
LoRA	Adds low-rank matrices to attention layers	~0.1-1%	Low
QLoRA	LoRA + 4-bit quantization	~0.1-1%	Very low
Prefix Tuning	Prepends trainable vectors to inputs	<1%	Low
Adapters	Inserts small trainable layers	~1-5%	Low

LoRA (Low-Rank Adaptation) — Most Popular

LoRA freezes the pre-trained weights and injects trainable low-rank decomposition matrices:

where and , with rank .

When to Use What

Prompt engineering first — no training needed, quick iteration
LoRA/QLoRA — when you need task-specific behavior with limited compute
Full fine-tuning — when you have large datasets and significant compute budget
RLHF — when aligning model outputs with human preferences

Q6: What is RLHF (Reinforcement Learning from Human Feedback) and how does it work?

Answer:

RLHF is a training technique that aligns LLM outputs with human preferences. It’s the key process that makes models like ChatGPT helpful, harmless, and honest.

graph TD
    subgraph Step1["Step 1: Supervised Fine-Tuning (SFT)"]
        SFT1["Pre-trained LLM"]
        SFT2["Human-written demonstrations"]
        SFT3["Fine-tuned model (SFT model)"]
        SFT1 --> SFT2 --> SFT3
    end

    subgraph Step2["Step 2: Reward Model Training"]
        RM1["SFT model generates multiple responses"]
        RM2["Humans rank responses by quality"]
        RM3["Train reward model on rankings"]
        RM1 --> RM2 --> RM3
    end

    subgraph Step3["Step 3: PPO Optimization"]
        PPO1["SFT model generates response"]
        PPO2["Reward model scores it"]
        PPO3["PPO updates policy to maximize reward"]
        PPO4["KL penalty prevents drift from SFT"]
        PPO1 --> PPO2 --> PPO3
        PPO3 --> PPO4
    end

    Step1 --> Step2 --> Step3

    style Step1 fill:#56cc9d,stroke:#333,color:#fff
    style Step2 fill:#6cc3d5,stroke:#333,color:#fff
    style Step3 fill:#ffce67,stroke:#333

The Three Steps

SFT (Supervised Fine-Tuning): Train the base model on high-quality human-written responses
Reward Model: Train a separate model to score responses based on human preference rankings
RL Optimization (PPO): Use the reward model as a signal to optimize the LLM’s outputs

Alternatives to RLHF

Method	Approach	Advantage
DPO (Direct Preference Optimization)	Directly optimize from preferences without a reward model	Simpler, more stable training
RLAIF	Use AI feedback instead of human feedback	Cheaper, more scalable
Constitutional AI	Self-critique against a set of principles	Less human annotation needed

Why RLHF Matters

Without RLHF, base LLMs tend to:

Continue text rather than answer questions
Generate toxic, biased, or harmful content
Hallucinate confidently
Ignore user instructions

Q7: What are hallucinations in LLMs and how can they be mitigated?

Answer:

Hallucinations are confident-sounding outputs that are factually incorrect, nonsensical, or unfaithful to the provided context. They are one of the biggest challenges in deploying LLMs.

Types of Hallucinations

graph TD
    H["LLM Hallucinations"]
    H --> INT["Intrinsic Hallucination
Contradicts the source input"]
    H --> EXT["Extrinsic Hallucination
Cannot be verified from source"]

    INT --> INT_EX["Example: Summary says 'John went to Paris'
when source says 'John went to London'"]
    EXT --> EXT_EX["Example: Model adds details
not present in any source"]

    style H fill:#ff7851,stroke:#333,color:#fff
    style INT fill:#ffce67,stroke:#333
    style EXT fill:#6cc3d5,stroke:#333,color:#fff

Causes

Cause	Explanation
Training data noise	Incorrect or contradictory information in pre-training corpus
Knowledge cutoff	Model generates outdated information
Pattern completion	Model prioritizes fluency over accuracy
Exposure bias	Errors compound during autoregressive generation
Lack of grounding	No mechanism to verify claims against facts

Mitigation Strategies

Strategy	How it helps
RAG (Retrieval-Augmented Generation)	Grounds responses in retrieved documents
Chain-of-thought prompting	Forces step-by-step reasoning, reduces logical errors
Temperature reduction	Lowers randomness, picks more likely tokens
Self-consistency	Generate multiple answers, pick the most common
Constrained decoding	Restrict outputs to valid formats
Citation requirements	Force model to cite sources
Fine-tuning on verified data	Teach the model to say “I don’t know”

Real-World Impact

Hallucinations are critical in high-stakes applications (legal, medical, financial). Production LLM systems almost always use RAG or other grounding techniques to minimize hallucinations.

Q8: What is Retrieval-Augmented Generation (RAG) and why is it important?

Answer:

RAG combines a retrieval system with a generative LLM to ground responses in external knowledge, reducing hallucinations and enabling access to up-to-date or domain-specific information.

graph LR
    Q["User Query"]
    Q --> EMB["Embed Query"]
    EMB --> SEARCH["Vector Search
(retrieve top-k documents)"]
    DB["Document Store
(vector database)"] --> SEARCH
    SEARCH --> CONTEXT["Retrieved Context"]
    CONTEXT --> PROMPT["Augmented Prompt
(query + context)"]
    Q --> PROMPT
    PROMPT --> LLM["LLM generates answer"]
    LLM --> ANS["Grounded Response"]

    style Q fill:#56cc9d,stroke:#333,color:#fff
    style SEARCH fill:#6cc3d5,stroke:#333,color:#fff
    style LLM fill:#ffce67,stroke:#333
    style ANS fill:#56cc9d,stroke:#333,color:#fff

RAG Pipeline Components

Component	Purpose	Common Tools
Document Loader	Ingest documents (PDF, web, DB)	LangChain, LlamaIndex
Chunking	Split documents into manageable pieces	Recursive, semantic splitting
Embedding Model	Convert text to dense vectors	OpenAI ada-002, BGE, E5
Vector Store	Store and search embeddings	Pinecone, Weaviate, ChromaDB, FAISS
Retriever	Find relevant chunks for a query	Similarity search, hybrid search
Generator (LLM)	Produce final answer from context	GPT-4, Claude, LLaMA

RAG vs. Fine-Tuning

Aspect	RAG	Fine-Tuning
Knowledge update	Instant (update document store)	Requires retraining
Cost	Lower (no GPU training)	Higher (compute for training)
Hallucination	Reduced (grounded in docs)	Can still hallucinate
Use case	Dynamic knowledge, Q&A	Style/behavior change
Transparency	Can cite sources	Black-box

When to Use RAG

Knowledge changes frequently (news, documentation)
Need verifiable, source-cited answers
Domain-specific knowledge not in pre-training data
Legal/compliance requirements for traceability

Q9: What is prompt engineering and what are the key techniques?

Answer:

Prompt engineering is the practice of designing inputs to LLMs to elicit desired outputs without modifying model weights. It’s the most accessible and cost-effective way to control LLM behavior.

Key Prompting Techniques

graph TD
    PE["Prompt Engineering Techniques"]
    PE --> ZS["Zero-Shot
'Classify this review as positive/negative'"]
    PE --> FS["Few-Shot
'Here are 3 examples, now do this one'"]
    PE --> COT["Chain-of-Thought
'Think step by step'"]
    PE --> SC["Self-Consistency
Sample multiple CoT paths, majority vote"]
    PE --> TOT["Tree-of-Thought
Explore multiple reasoning branches"]
    PE --> ROLE["Role Prompting
'You are an expert data scientist...'"]

    style PE fill:#56cc9d,stroke:#333,color:#fff
    style COT fill:#6cc3d5,stroke:#333,color:#fff
    style SC fill:#ffce67,stroke:#333

Comparison of Techniques

Technique	When to Use	Performance Boost
Zero-shot	Simple tasks, large models	Baseline
Few-shot	Need format guidance, smaller models	+10-30% on structured tasks
Chain-of-thought	Reasoning, math, logic	+20-50% on reasoning tasks
Self-consistency	High-accuracy requirements	+5-15% over single CoT
Tree-of-thought	Complex multi-step problems	Best for planning/search

System Prompt Best Practices

Be specific: “Extract the person’s name, company, and role” > “Extract information”
Define format: Specify JSON, markdown, or other output structures
Set constraints: “Answer only based on the provided context”
Provide examples: Show input-output pairs for complex tasks
Assign a role: “You are a senior Python developer reviewing code”

Temperature and Sampling Parameters

Parameter	Effect	Use Case
Temperature (0-2)	Controls randomness. Lower = deterministic	0 for factual, 0.7-1.0 for creative
Top-p (nucleus sampling)	Considers tokens within cumulative probability p	0.9 for balanced generation
Top-k	Considers only top k most likely tokens	Limits vocabulary for generation
Frequency penalty	Reduces repetition	Longer outputs without loops

Q10: What are the key challenges and considerations when deploying LLMs in production?

Answer:

Deploying LLMs in production involves challenges beyond model accuracy — including latency, cost, safety, and reliability.

graph TD
    PROD["LLM Production Challenges"]
    PROD --> PERF["Performance"]
    PROD --> COST["Cost"]
    PROD --> SAFETY["Safety & Guardrails"]
    PROD --> EVAL["Evaluation"]
    PROD --> OPS["Operations"]

    PERF --> P1["Latency (TTFT, TPS)"]
    PERF --> P2["Throughput"]
    PERF --> P3["Context window limits"]

    COST --> C1["Token costs"]
    COST --> C2["Infrastructure"]
    COST --> C3["Caching strategies"]

    SAFETY --> S1["Content filtering"]
    SAFETY --> S2["PII detection"]
    SAFETY --> S3["Prompt injection defense"]

    EVAL --> E1["Automated metrics"]
    EVAL --> E2["Human evaluation"]
    EVAL --> E3["A/B testing"]

    OPS --> O1["Monitoring & observability"]
    OPS --> O2["Version management"]
    OPS --> O3["Fallback strategies"]

    style PROD fill:#56cc9d,stroke:#333,color:#fff
    style PERF fill:#6cc3d5,stroke:#333,color:#fff
    style SAFETY fill:#ff7851,stroke:#333,color:#fff
    style COST fill:#ffce67,stroke:#333

Key Production Patterns

Pattern	Purpose	Implementation
Caching	Reduce cost & latency	Semantic cache (similar queries), exact cache
Streaming	Improve perceived latency	Server-sent events, token-by-token delivery
Guardrails	Prevent harmful outputs	Input/output validators, content filters
Fallbacks	Handle failures gracefully	Model cascading, rule-based backup
Rate limiting	Manage costs and abuse	Token budgets, per-user limits
Observability	Monitor quality over time	Log prompts/responses, track metrics

Optimization Techniques

Technique	Benefit
Quantization (4-bit, 8-bit)	2-4x memory reduction with minimal quality loss
KV-cache optimization	Faster inference for long contexts
Speculative decoding	2-3x speed improvement
Model distillation	Smaller, faster models that mimic larger ones
Prompt compression	Reduce token count while preserving meaning
Batching	Higher throughput for concurrent requests

Evaluation in Production

Automated: BLEU, ROUGE, BERTScore for generation quality
LLM-as-Judge: Use a stronger model to evaluate outputs
Human feedback: Thumbs up/down, preference ratings
Task-specific: Accuracy, F1, faithfulness scores
Safety: Toxicity rates, refusal rates, PII leakage

Security Considerations

Prompt injection: Adversarial inputs that override system instructions
Data leakage: Model revealing training data or system prompts
PII exposure: Generating or storing personally identifiable information
Jailbreaking: Users bypassing safety guardrails

Summary Table

#	Topic	Key Concept
1	Transformer Architecture	Self-attention replaces recurrence for parallel, long-range processing
2	Self-Attention	QKV mechanism computes token relevance scores
3	Tokenization	Subword strategies (BPE, WordPiece) balance vocabulary and sequence length
4	Model Types	Encoder-only, decoder-only, encoder-decoder serve different tasks
5	Fine-Tuning	LoRA/QLoRA enable efficient adaptation with minimal parameters
6	RLHF	Three-step alignment: SFT → Reward Model → PPO
7	Hallucinations	Confident wrong outputs; mitigated by RAG, CoT, temperature
8	RAG	Retrieval + generation for grounded, up-to-date responses
9	Prompt Engineering	Zero-shot, few-shot, CoT, and sampling parameters
10	Production Deployment	Latency, cost, safety, evaluation, and operational concerns

What’s Next?

This article covered the foundational LLM concepts most commonly tested in interviews. For deeper dives into specific topics:

ML fundamentals that underpin LLMs: ML Interview QA - 1
Evaluation metrics and data handling: ML Interview QA - 2

LLM Interview QA - 2

Vectoring AI — Wed, 20 May 2026 00:00:00 GMT

Introduction

This is Part 2 of our LLM Interview QA series. It covers 10 advanced questions on scaling, optimization, agents, and evaluation — the practical knowledge that separates candidates who use LLMs from those who engineer production LLM systems.

For foundational LLM concepts (transformers, attention, tokenization, RAG, RLHF), see LLM Interview QA - 1. For ML fundamentals, see ML Interview QA - 1 and ML Interview QA - 2.

Q1: What are scaling laws in LLMs and why do they matter?

Answer:

Scaling laws are empirical relationships describing how LLM performance improves predictably as you increase model size, dataset size, and compute budget.

The Chinchilla Scaling Law

The key finding (Hoffmann et al., 2022): for a given compute budget, model size and training tokens should be scaled equally.

where:

= number of parameters
= number of training tokens
= loss (lower is better)
= irreducible loss (entropy of natural language)

graph TD
    subgraph Scaling["Scaling Laws: Three Axes"]
        PARAMS["Model Parameters (N)
More params → lower loss"]
        DATA["Training Data (D)
More tokens → lower loss"]
        COMPUTE["Compute (C ≈ 6ND)
More FLOPs → lower loss"]
    end

    subgraph Tradeoff["Chinchilla Optimal"]
        OPT["For budget C:
Scale N and D equally
D ≈ 20 × N tokens"]
    end

    Scaling --> Tradeoff

    style Scaling fill:#56cc9d,stroke:#333,color:#fff
    style Tradeoff fill:#6cc3d5,stroke:#333,color:#fff

Practical Implications

Insight	Implication
Loss decreases as a power law	Returns diminish but never plateau (with enough data)
Chinchilla-optimal training	Many early LLMs were undertrained (GPT-3: 300B tokens for 175B params)
Compute-optimal models	LLaMA-70B outperforms GPT-3-175B by training on more tokens
Emergent abilities	Some capabilities appear only above certain scale thresholds

Emergent Abilities

Certain capabilities appear suddenly at scale rather than improving gradually:

Chain-of-thought reasoning — emerges around 60-100B parameters
In-context learning — improves dramatically with scale
Multi-step math — requires large models to perform reliably

Q2: How do LLMs handle long context windows and what are the key challenges?

Answer:

The context window is the maximum number of tokens an LLM can process in a single forward pass. Modern models range from 4K to 1M+ tokens.

The Challenge: Quadratic Attention

Standard self-attention has complexity in both time and memory with respect to sequence length :

graph TD
    subgraph Problem["The Quadratic Problem"]
        S1["4K tokens → 16M attention entries"]
        S2["32K tokens → 1B attention entries"]
        S3["128K tokens → 16B attention entries"]
        S4["1M tokens → 1T attention entries"]
    end

    subgraph Solutions["Solutions"]
        SOL1["Efficient Attention
Flash Attention, Ring Attention"]
        SOL2["Position Extrapolation
RoPE, ALiBi, YaRN"]
        SOL3["Sparse Attention
Sliding window, dilated"]
        SOL4["Memory/Retrieval
Compress old context"]
    end

    Problem --> Solutions

    style Problem fill:#ff7851,stroke:#333,color:#fff
    style Solutions fill:#56cc9d,stroke:#333,color:#fff

Key Techniques for Long Context

Technique	How it works	Used by
Flash Attention	Fused GPU kernels, tiled computation to reduce memory I/O	Most modern LLMs
RoPE (Rotary Position Embedding)	Encodes relative positions via rotation matrices	LLaMA, Mistral
ALiBi	Linear bias based on distance (no positional embedding)	BLOOM
Sliding Window Attention	Each token attends only to nearby tokens	Mistral
Ring Attention	Distributes sequence across multiple GPUs	Long-context training
YaRN	Extends RoPE to longer contexts via interpolation	Extended LLaMA

The “Lost in the Middle” Problem

Research shows LLMs tend to:

Remember information at the beginning and end of context well
Struggle with information in the middle of long contexts
Performance degrades as relevant information is placed further from query

Practical Considerations

Longer context ≠ better retrieval (RAG often outperforms naive long context)
KV cache memory grows linearly with sequence length
Inference cost increases with context length even if most tokens are irrelevant

Q3: What is model quantization and how does it enable LLM deployment?

Answer:

Quantization reduces the numerical precision of model weights (and sometimes activations) from high-precision formats (FP32/FP16) to lower-precision formats (INT8/INT4), dramatically reducing memory and improving inference speed.

graph LR
    subgraph Precision["Precision Formats"]
        FP32["FP32 (32-bit)
Full precision
4 bytes per param"]
        FP16["FP16/BF16 (16-bit)
Half precision
2 bytes per param"]
        INT8["INT8 (8-bit)
Quarter precision
1 byte per param"]
        INT4["INT4 (4-bit)
Eighth precision
0.5 bytes per param"]
    end

    FP32 -->|"2x compression"| FP16
    FP16 -->|"2x compression"| INT8
    INT8 -->|"2x compression"| INT4

    style FP32 fill:#ff7851,stroke:#333,color:#fff
    style FP16 fill:#ffce67,stroke:#333
    style INT8 fill:#6cc3d5,stroke:#333,color:#fff
    style INT4 fill:#56cc9d,stroke:#333,color:#fff

Memory Requirements Example (LLaMA-70B)

Precision	Memory Required	Hardware Needed
FP32	~280 GB	4× A100 80GB
FP16	~140 GB	2× A100 80GB
INT8	~70 GB	1× A100 80GB
INT4	~35 GB	1× A6000 48GB or consumer GPU

Quantization Approaches

Method	Type	Description
PTQ (Post-Training Quantization)	After training	Quantize a trained model without retraining
QAT (Quantization-Aware Training)	During training	Simulate quantization during training for better accuracy
GPTQ	PTQ, weight-only	Layer-wise quantization using calibration data
AWQ (Activation-Aware)	PTQ, weight-only	Protects salient weights based on activation magnitude
GGUF (llama.cpp format)	PTQ, various bits	CPU-friendly format with mixed precision

Quality vs. Compression Tradeoff

Quantization	Perplexity Impact	Speed Gain	Use Case
FP16 → INT8	Negligible (<0.1%)	1.5-2x	Production serving
FP16 → INT4	Small (1-3%)	2-3x	Edge deployment, personal use
FP16 → INT2	Significant (5-15%)	3-4x	Experimental only

Key Insight

4-bit quantization (GPTQ, AWQ) has become the standard for local LLM deployment because it offers near-lossless quality with 4x memory reduction — enabling 70B models to run on consumer hardware.

Q4: What are LLM Agents and how do they extend LLM capabilities?

Answer:

LLM Agents are systems where an LLM acts as a reasoning engine that can plan, use tools, and take actions to accomplish goals — going beyond simple text generation.

graph TD
    subgraph Agent["LLM Agent Architecture"]
        LLM_CORE["LLM (Brain)
Reasoning & Planning"]
        MEMORY["Memory
Short-term: conversation
Long-term: vector store"]
        TOOLS["Tools
Code execution, APIs,
Search, Databases"]
        PLANNING["Planning
Task decomposition,
Reflection, Replanning"]
    end

    USER["User Goal"] --> LLM_CORE
    LLM_CORE --> PLANNING
    PLANNING --> TOOLS
    TOOLS --> |"Observation"| LLM_CORE
    LLM_CORE --> MEMORY
    MEMORY --> LLM_CORE
    LLM_CORE --> RESULT["Final Result"]

    style Agent fill:#56cc9d,stroke:#333,color:#fff
    style USER fill:#6cc3d5,stroke:#333,color:#fff
    style RESULT fill:#ffce67,stroke:#333

Core Agent Patterns

Pattern	How it works	Example
ReAct	Reason → Act → Observe loop	“I need to search for X” → search → “I found Y, now I’ll…”
Plan-and-Execute	Create full plan first, then execute steps	Break complex task into subtasks
Reflection	Agent critiques its own output and improves	Self-check for errors before responding
Multi-Agent	Multiple specialized agents collaborate	Researcher + Coder + Reviewer

Tool Use

Tools transform LLMs from text generators into action-taking systems:

Tool Category	Examples	Why Needed
Code Execution	Python REPL, sandboxed environments	Precise computation, data analysis
Search	Web search, document retrieval	Access to current information
APIs	Weather, calendar, databases	Real-world interactions
File I/O	Read/write files, parse documents	Persistent data manipulation

Agent Frameworks

Framework	Approach	Strength
LangGraph	Graph-based state machines	Complex workflows, cycles
CrewAI	Role-based multi-agent	Team collaboration metaphor
AutoGen	Conversational agents	Multi-agent conversation
OpenAI Assistants	Managed agent platform	Easy deployment

Challenges

Reliability: Agents can go off-track or loop infinitely
Cost: Multiple LLM calls per task (tool reasoning is expensive)
Safety: Autonomous actions need guardrails
Evaluation: Hard to benchmark open-ended agent behavior

Q5: How do you evaluate LLM performance and what metrics are used?

Answer:

LLM evaluation is uniquely challenging because outputs are open-ended text. Different tasks require different evaluation approaches.

graph TD
    EVAL["LLM Evaluation"]
    EVAL --> AUTO["Automated Metrics"]
    EVAL --> HUMAN["Human Evaluation"]
    EVAL --> LLM_JUDGE["LLM-as-Judge"]
    EVAL --> BENCH["Benchmarks"]

    AUTO --> A1["Perplexity"]
    AUTO --> A2["BLEU / ROUGE"]
    AUTO --> A3["BERTScore"]
    AUTO --> A4["Exact Match / F1"]

    HUMAN --> H1["Preference ranking"]
    HUMAN --> H2["Likert scale ratings"]
    HUMAN --> H3["Task completion rate"]

    LLM_JUDGE --> L1["Pairwise comparison"]
    LLM_JUDGE --> L2["Rubric-based scoring"]
    LLM_JUDGE --> L3["Reference-free evaluation"]

    BENCH --> B1["MMLU (knowledge)"]
    BENCH --> B2["HumanEval (coding)"]
    BENCH --> B3["GSM8K (math)"]
    BENCH --> B4["TruthfulQA (honesty)"]

    style EVAL fill:#56cc9d,stroke:#333,color:#fff
    style AUTO fill:#6cc3d5,stroke:#333,color:#fff
    style LLM_JUDGE fill:#ffce67,stroke:#333
    style BENCH fill:#ff7851,stroke:#333,color:#fff

Metric Selection by Task

Task	Primary Metrics	Why
Summarization	ROUGE-L, faithfulness, BERTScore	Overlap + semantic similarity
Translation	BLEU, chrF, COMET	N-gram overlap + learned metrics
Code Generation	pass@k, HumanEval	Functional correctness
Question Answering	Exact Match, F1, faithfulness	Factual accuracy
Chat/Dialog	Human preference, LLM-judge	No single ground truth
Reasoning	Accuracy on benchmarks (GSM8K, MATH)	Verifiable correct answers

LLM-as-Judge

Using a stronger LLM to evaluate outputs has become standard:

System: You are an expert evaluator. Rate the following response on:
1. Helpfulness (1-5)
2. Accuracy (1-5)
3. Harmlessness (1-5)

Provide scores and brief justification.

Advantages: Scalable, consistent, correlates well with human judgment
Limitations: Biases (position bias, verbosity bias, self-preference)

Key Benchmarks (2024-2026)

Benchmark	What it tests	Top performers
MMLU	57 subjects, world knowledge	GPT-4, Claude 3.5
HumanEval / MBPP	Code generation	GPT-4, Claude 3.5, DeepSeek
GSM8K / MATH	Mathematical reasoning	O1, Claude 3.5
MT-Bench	Multi-turn conversation	GPT-4, Claude
GPQA	PhD-level science questions	O1, Gemini Ultra
SWE-Bench	Real-world software engineering	Claude 3.5, Devin

Evaluation Anti-Patterns

Benchmark contamination: Test data leaks into training
Over-optimizing for benchmarks: Gaming metrics without real improvement
Single metric reliance: Missing failure modes
Static evaluation: Not tracking performance over time in production

Q6: What are embeddings and how are they used in LLM applications?

Answer:

Embeddings are dense vector representations that capture semantic meaning. They convert text (words, sentences, documents) into numerical vectors where similar meanings are geometrically close.

graph LR
    subgraph Embedding_Space["Embedding Space"]
        direction TB
        K["'king' [0.2, 0.8, ...]"]
        Q["'queen' [0.3, 0.8, ...]"]
        M["'man' [0.2, 0.1, ...]"]
        W["'woman' [0.3, 0.1, ...]"]
    end

    subgraph Applications["Applications"]
        SEM["Semantic Search"]
        CLUST["Clustering"]
        CLASS["Classification"]
        RAG_APP["RAG Retrieval"]
        REC["Recommendations"]
    end

    Embedding_Space --> Applications

    style Embedding_Space fill:#6cc3d5,stroke:#333,color:#fff
    style Applications fill:#56cc9d,stroke:#333,color:#fff

Types of Embeddings

Type	Granularity	Models	Use Case
Word embeddings	Single words	Word2Vec, GloVe	Legacy, fast lookup
Contextual embeddings	Words in context	BERT, GPT hidden states	NER, classification
Sentence embeddings	Full sentences	E5, BGE, all-MiniLM	Semantic search, RAG
Document embeddings	Paragraphs/pages	Voyage, Cohere Embed	Document retrieval

Similarity Metrics

Metric	Range	Best for
Cosine similarity	[-1, 1]	Most NLP tasks
Euclidean distance	[0, ∞)	Clustering
Dot product	(-∞, ∞)	When magnitude matters

Modern Embedding Models (2024-2026)

Model	Dimensions	Context	Strength
OpenAI text-embedding-3-large	3072	8K	General purpose
BGE-M3	1024	8K	Multilingual, multi-granularity
E5-Mistral-7B	4096	32K	Long documents
Cohere Embed v3	1024	512	Multi-language, classification
Voyage-3	1024	32K	Code + text

Practical Embedding Pipeline

Chunk documents into semantically meaningful segments
Embed chunks using a sentence embedding model
Store vectors in a vector database (Pinecone, Weaviate, pgvector)
Query-time: embed the user query with the same model
Retrieve: find top-k nearest neighbors via ANN (approximate nearest neighbor)

Common Pitfalls

Using different embedding models for indexing vs. querying
Chunks too large (loses specificity) or too small (loses context)
Not normalizing vectors when using cosine similarity
Ignoring embedding model’s max token limit

Q7: What is Mixture of Experts (MoE) and how does it enable efficient scaling?

Answer:

Mixture of Experts (MoE) is an architecture where only a subset of the model’s parameters are activated for each input token, enabling much larger models without proportional compute increase.

graph TD
    INPUT["Input Token"]
    INPUT --> ROUTER["Router/Gating Network
Learns which experts to activate"]
    ROUTER -->|"Top-k selection"| E1["Expert 1
(FFN)"]
    ROUTER -->|"Top-k selection"| E2["Expert 2
(FFN)"]
    ROUTER -.->|"Not selected"| E3["Expert 3
(FFN)"]
    ROUTER -.->|"Not selected"| E4["Expert 4
(FFN)"]
    ROUTER -.->|"Not selected"| EN["Expert N
(FFN)"]

    E1 --> COMBINE["Weighted Combination"]
    E2 --> COMBINE
    COMBINE --> OUTPUT["Output"]

    style INPUT fill:#56cc9d,stroke:#333,color:#fff
    style ROUTER fill:#ffce67,stroke:#333
    style E1 fill:#6cc3d5,stroke:#333,color:#fff
    style E2 fill:#6cc3d5,stroke:#333,color:#fff
    style COMBINE fill:#56cc9d,stroke:#333,color:#fff

Dense vs. MoE Comparison

Aspect	Dense Model	MoE Model
Parameters activated per token	100%	~12-25% (top-k experts)
Total parameters	e.g., 70B	e.g., 8×7B = 56B total, ~13B active
Inference compute	Proportional to total params	Proportional to active params
Memory	Moderate	High (all experts in memory)
Training efficiency	Lower	Higher (more params per FLOP)

Notable MoE Models

Model	Architecture	Active Params	Total Params
Mixtral 8×7B	8 experts, top-2 routing	~13B	~47B
GPT-4 (rumored)	MoE architecture	Unknown	~1.8T
DeepSeek-V2	160 experts, top-6	~21B	~236B
Switch Transformer	Top-1 routing	Variable	Up to 1.6T

Challenges with MoE

Challenge	Description	Mitigation
Load balancing	Some experts get used more than others	Auxiliary loss, expert capacity limits
Memory	All experts must be in memory	Expert parallelism, offloading
Training instability	Routing can collapse to few experts	Noisy top-k, load balancing loss
Communication overhead	Experts on different GPUs need data transfer	Efficient all-to-all communication

Why MoE Matters

MoE allows training models with trillions of parameters while keeping inference costs manageable — it’s likely the architecture behind the largest frontier models.

Q8: What is the KV Cache and how does it affect LLM inference?

Answer:

The KV (Key-Value) Cache stores previously computed key and value vectors from the attention mechanism, avoiding redundant computation during autoregressive generation.

Why KV Cache is Necessary

During generation, the LLM produces one token at a time. Without caching, generating token requires recomputing attention over all previous tokens from scratch.

graph LR
    subgraph Without_Cache["Without KV Cache"]
        W1["Token 1: compute K,V for [1]"]
        W2["Token 2: compute K,V for [1,2]"]
        W3["Token 3: compute K,V for [1,2,3]"]
        W4["Token N: compute K,V for [1,...,N]"]
        W1 --> W2 --> W3 --> W4
    end

    subgraph With_Cache["With KV Cache"]
        C1["Token 1: compute & store K₁,V₁"]
        C2["Token 2: compute K₂,V₂, reuse K₁,V₁"]
        C3["Token 3: compute K₃,V₃, reuse K₁,V₁,K₂,V₂"]
        C1 --> C2 --> C3
    end

    style Without_Cache fill:#ff7851,stroke:#333,color:#fff
    style With_Cache fill:#56cc9d,stroke:#333,color:#fff

Complexity Comparison

Metric	Without KV Cache	With KV Cache
Compute per token	(recompute all)	(new token only)
Total generation
Memory		grows with sequence

KV Cache Memory Problem

For a model with layers, heads, head dimension, sequence length :

Example: LLaMA-70B with 128K context in FP16: GB — just for the cache!

KV Cache Optimization Techniques

Technique	How it helps	Compression
Multi-Query Attention (MQA)	Share K,V across all heads	~8-16x reduction
Grouped-Query Attention (GQA)	Share K,V across groups of heads	~4-8x reduction
KV Cache Quantization	Store K,V in INT8/INT4	2-4x reduction
Paged Attention (vLLM)	Virtual memory for KV cache	Better memory utilization
Sliding Window	Only cache recent tokens	Bounded cache size

Prefill vs. Decode Phases

Phase	What happens	Bottleneck
Prefill	Process all input tokens in parallel, fill KV cache	Compute-bound
Decode	Generate one token at a time using cached KV	Memory-bandwidth-bound

The decode phase is typically memory-bandwidth-bound because each new token requires reading the entire KV cache from memory.

Q9: What is instruction tuning and how does it differ from pre-training and RLHF?

Answer:

Instruction tuning (also called supervised fine-tuning or SFT) trains a base LLM on (instruction, response) pairs to follow human instructions — it’s the critical bridge between a next-token predictor and a useful assistant.

graph LR
    subgraph Pipeline["LLM Training Pipeline"]
        direction LR
        PT["Pre-training
Next-token prediction
on internet text
(Trillions of tokens)"]
        IT["Instruction Tuning (SFT)
Train on (instruction, response)
pairs from humans
(10K-1M examples)"]
        RLHF_step["RLHF / DPO
Align with human preferences
via reward signals
(Preference pairs)"]
    end

    PT -->|"Base model"| IT
    IT -->|"SFT model"| RLHF_step
    RLHF_step -->|"Aligned model"| FINAL["Production Model"]

    style PT fill:#6cc3d5,stroke:#333,color:#fff
    style IT fill:#56cc9d,stroke:#333,color:#fff
    style RLHF_step fill:#ffce67,stroke:#333

Comparison of Training Stages

Aspect	Pre-training	Instruction Tuning	RLHF/DPO
Objective	Predict next token	Follow instructions	Align with preferences
Data	Raw text (web crawl)	(Instruction, Response) pairs	Ranked response pairs
Data size	Trillions of tokens	10K–1M examples	10K–100K comparisons
Compute	Massive (months on clusters)	Moderate (hours–days)	Moderate
Effect	General language understanding	Task following, formatting	Helpfulness, safety, style

What Makes Good Instruction Tuning Data?

Quality Factor	Description
Diversity	Cover many task types (QA, coding, math, writing, analysis)
Complexity	Include both simple and multi-step instructions
Format variety	JSON, markdown, code, natural language responses
Correctness	Responses must be accurate and complete
Safety	Include refusal examples for harmful requests

Notable Instruction Datasets

Dataset	Size	Source	Used By
FLAN	1.8M	Converted NLP tasks into instructions	Flan-T5, Flan-PaLM
Alpaca	52K	GPT-4 generated	Stanford Alpaca
ShareGPT	90K+	User conversations with ChatGPT	Vicuna
OpenHermes	1M+	Curated multi-source	Many open models
UltraChat	1.5M	Multi-turn synthetic dialogues	Zephyr

Base Model vs. Instruction-Tuned Behavior

Prompt: “What is the capital of France?”

Base model: “What is the capital of Germany? What is the capital of Spain?…” (continues the pattern)
Instruction-tuned: “The capital of France is Paris.” (answers directly)

Q10: What are the key inference optimization techniques for serving LLMs at scale?

Answer:

Serving LLMs in production requires optimizing for latency (time to first token, tokens per second), throughput (requests per second), and cost (dollars per token).

graph TD
    OPT["LLM Inference Optimization"]
    OPT --> MODEL["Model-Level"]
    OPT --> SYSTEM["System-Level"]
    OPT --> SERVE["Serving-Level"]

    MODEL --> M1["Quantization (INT4/INT8)"]
    MODEL --> M2["Distillation"]
    MODEL --> M3["Pruning"]
    MODEL --> M4["Speculative Decoding"]

    SYSTEM --> S1["Flash Attention"]
    SYSTEM --> S2["Continuous Batching"]
    SYSTEM --> S3["Paged Attention (vLLM)"]
    SYSTEM --> S4["Tensor Parallelism"]

    SERVE --> SV1["KV Cache Management"]
    SERVE --> SV2["Request Scheduling"]
    SERVE --> SV3["Prefix Caching"]
    SERVE --> SV4["Model Routing"]

    style OPT fill:#56cc9d,stroke:#333,color:#fff
    style MODEL fill:#6cc3d5,stroke:#333,color:#fff
    style SYSTEM fill:#ffce67,stroke:#333
    style SERVE fill:#ff7851,stroke:#333,color:#fff

Speculative Decoding

Uses a small, fast draft model to predict multiple tokens, then the large model verifies them in parallel:

Step	What happens
1	Draft model generates k candidate tokens quickly
2	Large model verifies all k tokens in one forward pass
3	Accept correct tokens, reject and regenerate from first mismatch
Result	2-3x faster generation with identical output quality

Continuous Batching vs. Static Batching

Approach	Description	Efficiency
Static batching	Wait for all sequences to finish	Low (short sequences wait for long ones)
Continuous batching	Insert new requests as old ones finish	High (no idle GPU cycles)

Parallelism Strategies

Strategy	What it splits	When to use
Tensor Parallelism	Individual layers across GPUs	Single node, low latency
Pipeline Parallelism	Different layers on different GPUs	Multi-node, high throughput
Data Parallelism	Same model, different batches	Scaling throughput
Expert Parallelism	MoE experts across GPUs	MoE models

LLM Serving Frameworks

Framework	Key Feature	Best For
vLLM	Paged Attention, continuous batching	High-throughput serving
TensorRT-LLM	NVIDIA-optimized kernels	NVIDIA GPUs, lowest latency
llama.cpp	CPU/Metal inference, GGUF format	Edge/local deployment
TGI (Text Generation Inference)	Hugging Face integration	Quick deployment
SGLang	RadixAttention, structured generation	Complex prompting workflows

Key Metrics for Production Serving

Metric	Definition	Target
TTFT (Time to First Token)	Latency before generation starts	<500ms
TPS (Tokens Per Second)	Generation speed per request	30-100 TPS
Throughput	Total tokens/sec across all requests	Maximize
P99 latency	Worst-case latency	<2s TTFT
Cost per 1M tokens	Dollar efficiency	Minimize

Summary Table

#	Topic	Key Concept
1	Scaling Laws	Performance improves predictably as power law with model/data/compute
2	Long Context	Quadratic attention problem; solved by Flash Attention, RoPE, sparse methods
3	Quantization	4-bit weights enable 70B models on consumer GPUs with minimal quality loss
4	LLM Agents	LLMs + tools + memory + planning = autonomous task completion
5	Evaluation	Benchmarks, LLM-as-Judge, and task-specific metrics
6	Embeddings	Dense vectors for semantic search, RAG retrieval, clustering
7	Mixture of Experts	Activate subset of parameters per token for efficient scaling
8	KV Cache	Store computed keys/values to avoid redundant attention computation
9	Instruction Tuning	Transform base models into instruction-following assistants
10	Inference Optimization	Speculative decoding, continuous batching, parallelism for production

What’s Next?

This article covered advanced LLM engineering topics for interview preparation. For related content:

Foundational LLM concepts: LLM Interview QA - 1
ML fundamentals: ML Interview QA - 1
Metrics and feature engineering: ML Interview QA - 2

LLM Interview QA - 3

Vectoring AI — Wed, 20 May 2026 00:00:00 GMT

Introduction

This is Part 3 of our LLM Interview QA series, focused on LLM configuration and generation control. Understanding how to configure LLM parameters — temperature, sampling strategies, context windows, and decoding methods — is essential for building reliable AI systems.

For foundational LLM concepts (transformers, attention, RAG, RLHF), see LLM Interview QA - 1. For advanced topics (scaling, quantization, agents), see LLM Interview QA - 2. For ML fundamentals, see ML Interview QA - 1.

Q1: What are the main configurable parameters when calling an LLM API?

Answer:

When making an LLM API call, several parameters control the behavior, quality, and cost of the generated output.

graph TD
    CONFIG["LLM Configuration Parameters"]
    CONFIG --> GEN["Generation Control"]
    CONFIG --> SAMP["Sampling Parameters"]
    CONFIG --> OUT["Output Control"]
    CONFIG --> SYS["System Parameters"]

    GEN --> G1["temperature"]
    GEN --> G2["top_p (nucleus)"]
    GEN --> G3["top_k"]
    GEN --> G4["seed"]

    SAMP --> S1["frequency_penalty"]
    SAMP --> S2["presence_penalty"]
    SAMP --> S3["repetition_penalty"]
    SAMP --> S4["logit_bias"]

    OUT --> O1["max_tokens / max_new_tokens"]
    OUT --> O2["stop sequences"]
    OUT --> O3["n (num_return_sequences)"]
    OUT --> O4["stream"]

    SYS --> SY1["model"]
    SYS --> SY2["system prompt"]
    SYS --> SY3["response_format"]
    SYS --> SY4["tools / functions"]

    style CONFIG fill:#56cc9d,stroke:#333,color:#fff
    style GEN fill:#6cc3d5,stroke:#333,color:#fff
    style SAMP fill:#ffce67,stroke:#333
    style OUT fill:#ff7851,stroke:#333,color:#fff

Parameter Overview

Parameter	Range	Default	Purpose
`temperature`	0.0 – 2.0	1.0	Controls randomness of output
`top_p`	0.0 – 1.0	1.0	Nucleus sampling threshold
`top_k`	1 – vocab_size	50 (varies)	Limits token candidates
`max_tokens`	1 – context_limit	Model-specific	Maximum output length
`frequency_penalty`	-2.0 – 2.0	0.0	Penalizes repeated tokens
`presence_penalty`	-2.0 – 2.0	0.0	Encourages topic diversity
`seed`	Any integer	None	Enables deterministic output
`stop`	List of strings	None	Stops generation at specific tokens
`n`	1+	1	Number of completions to generate

Practical Configuration Examples

Use Case	temperature	top_p	max_tokens	Other
Code generation	0.0 – 0.2	1.0	2048	`stop=["\n\n"]`
Creative writing	0.8 – 1.2	0.95	4096	`frequency_penalty=0.5`
Data extraction	0.0	1.0	512	`response_format=json`
Chat conversation	0.7	0.9	1024	`presence_penalty=0.3`
Factual Q&A	0.0 – 0.3	1.0	256	—

Q2: What is temperature and how does it affect LLM output?

Answer:

Temperature controls the randomness of the probability distribution over the vocabulary at each generation step. It’s applied to the logits before the softmax function.

Mathematical Definition

Given logits for each token in the vocabulary:

where is the temperature.

graph LR
    subgraph T0["Temperature = 0 (Greedy)"]
        T0_1["Token A: 99.9%"]
        T0_2["Token B: 0.1%"]
        T0_3["Token C: ~0%"]
    end

    subgraph T07["Temperature = 0.7"]
        T07_1["Token A: 75%"]
        T07_2["Token B: 20%"]
        T07_3["Token C: 5%"]
    end

    subgraph T1["Temperature = 1.0 (Default)"]
        T1_1["Token A: 60%"]
        T1_2["Token B: 25%"]
        T1_3["Token C: 15%"]
    end

    subgraph T2["Temperature = 2.0"]
        T2_1["Token A: 40%"]
        T2_2["Token B: 32%"]
        T2_3["Token C: 28%"]
    end

    style T0 fill:#56cc9d,stroke:#333,color:#fff
    style T07 fill:#6cc3d5,stroke:#333,color:#fff
    style T1 fill:#ffce67,stroke:#333
    style T2 fill:#ff7851,stroke:#333,color:#fff

Effect of Temperature

Temperature	Distribution	Behavior	Output Character
T → 0	Extremely peaked	Always picks highest-probability token	Deterministic, repetitive, safe
T = 0.3	Slightly softened	Mostly picks top tokens, rare surprises	Conservative, coherent
T = 0.7	Moderately spread	Balanced between likely and creative	Good default for most tasks
T = 1.0	Original distribution	Model’s “natural” uncertainty	Raw model behavior
T > 1.0	Flattened	Low-probability tokens become likely	Creative but potentially incoherent
T = 2.0	Nearly uniform	Almost random selection	Chaotic, nonsensical

Intuition

Think of temperature as a “creativity knob”:

Low temperature (0–0.3): The model is confident and focused — it picks the most obvious next word. Great for factual tasks, code, structured extraction.
Medium temperature (0.5–0.8): The model is balanced — it explores alternatives while staying coherent. Best for general chat and writing.
High temperature (1.0+): The model is adventurous — it considers unlikely words, producing surprising or creative outputs.

Common Interview Follow-Up: “What does temperature=0 actually mean?”

Setting temperature=0 is a shortcut for greedy decoding — the model always selects the single highest-probability token. However:

It’s still based on floating-point arithmetic, so minor non-determinism can occur across hardware
Most APIs interpret temperature=0 as “return the argmax token” deterministically
Some providers require setting a seed parameter for guaranteed reproducibility

Q3: What is the difference between Top-p (nucleus) sampling and Top-k sampling?

Answer:

Both Top-p and Top-k are token filtering strategies that limit which tokens are considered during generation, but they differ in how they determine the candidate set.

Top-k Sampling

Select the k most probable tokens and redistribute probability among them:

Top-p (Nucleus) Sampling

Select the smallest set of tokens whose cumulative probability exceeds :

graph TD
    subgraph TopK["Top-k = 3 (Fixed Size)"]
        direction LR
        K1["'the' (0.40) ✓"]
        K2["'a' (0.25) ✓"]
        K3["'my' (0.15) ✓"]
        K4["'his' (0.10) ✗"]
        K5["'our' (0.05) ✗"]
        K6["'their' (0.03) ✗"]
    end

    subgraph TopP["Top-p = 0.9 (Dynamic Size)"]
        direction LR
        P1["'the' (0.40) ✓ → cumulative: 0.40"]
        P2["'a' (0.25) ✓ → cumulative: 0.65"]
        P3["'my' (0.15) ✓ → cumulative: 0.80"]
        P4["'his' (0.10) ✓ → cumulative: 0.90 ≥ p"]
        P5["'our' (0.05) ✗"]
        P6["'their' (0.03) ✗"]
    end

    style TopK fill:#6cc3d5,stroke:#333,color:#fff
    style TopP fill:#56cc9d,stroke:#333,color:#fff

Key Differences

Aspect	Top-k	Top-p
Candidate set size	Fixed (always k tokens)	Dynamic (varies per step)
Adapts to distribution shape	No — same k regardless of certainty	Yes — fewer tokens when confident
Risk when distribution is peaked	Includes unlikely tokens unnecessarily	Naturally narrows to top few
Risk when distribution is flat	May exclude reasonable tokens	Naturally includes more candidates

Why Top-p is Generally Preferred

Consider two scenarios at different generation steps:

Step A (peaked distribution): Model is 95% sure the next word is “Paris”

Top-k=50: Considers 50 tokens (49 are noise)
Top-p=0.95: Considers only 1-2 tokens (adaptive!)

Step B (flat distribution): Model is uncertain, many tokens are equally likely

Top-k=50: Might miss some reasonable candidates if vocabulary is large
Top-p=0.95: Includes all tokens until 95% mass is covered (could be 100+ tokens)

Combining Top-k and Top-p

In practice, many systems use both simultaneously:

First apply Top-k to limit to k candidates
Then apply Top-p within those k candidates

This provides both an upper bound (Top-k) and adaptive filtering (Top-p).

Recommended Settings

Task	top_k	top_p	Rationale
Deterministic (code, facts)	1	1.0	Equivalent to greedy
Balanced (chat)	40-50	0.9	Diverse but coherent
Creative (stories)	100+	0.95	Wide exploration
Structured output (JSON)	5-10	0.8	Limited, safe choices

Q4: What is the context window and how does it constrain LLM behavior?

Answer:

The context window (also called context length or maximum sequence length) is the total number of tokens an LLM can process in a single inference call — this includes both input tokens and output tokens.

graph LR
    subgraph CW["Context Window (e.g., 128K tokens)"]
        direction LR
        SYS["System Prompt
(500 tokens)"]
        CTX["Retrieved Context / RAG
(10,000 tokens)"]
        HIST["Conversation History
(5,000 tokens)"]
        USER["User Message
(200 tokens)"]
        RESP["Model Response
(max_tokens: 4,096)"]
    end

    style CW fill:#56cc9d,stroke:#333,color:#fff
    style RESP fill:#ffce67,stroke:#333

Context Window Sizes (2024–2026)

Model	Context Window	Notes
GPT-3.5 Turbo	16K tokens	~12K words
GPT-4	128K tokens	~96K words
GPT-4o	128K tokens	~96K words
Claude 3.5 Sonnet	200K tokens	~150K words
Gemini 1.5 Pro	1M–2M tokens	Longest available
LLaMA 3.1	128K tokens	Open-source
Mistral Large	128K tokens
DeepSeek-V3	128K tokens

What Happens When You Exceed the Context Window?

Behavior	Description
Truncation	Oldest tokens are dropped (APIs return error or truncate)
Error	API rejects the request if input exceeds limit
Degraded performance	Even within limits, performance drops in the “middle”

Context Window vs. Effective Context

Key insight for interviews: The advertised context window is not the same as effective context:

Concept	Meaning
Maximum context	Technical limit the model supports
Effective context	Length at which performance remains high
“Lost in the middle”	Information in the center of long contexts is often missed
Needle-in-a-haystack	Benchmark: can the model find a fact placed at position X?

Strategies for Context Window Management

Strategy	How it works
Chunking + RAG	Only retrieve relevant chunks, don’t stuff everything
Summarization	Compress conversation history into summaries
Sliding window	Keep recent messages + system prompt, drop old middle
Hierarchical context	Summary of old messages + full recent messages
Prompt compression	Use tools like LLMLingua to compress prompts

Cost Implications

Context window directly affects cost:

Longer contexts mean higher costs, higher latency, and more KV-cache memory usage.

Q5: Is LLM generation deterministic? How do you achieve reproducible outputs?

Answer:

By default, LLM generation is non-deterministic — the same prompt can produce different outputs across calls. This is intentional but can be controlled.

graph TD
    subgraph NonDet["Non-Deterministic (Default)"]
        ND1["Same prompt"]
        ND2["Run 1: 'The capital is Paris.'"]
        ND3["Run 2: 'Paris is the capital of France.'"]
        ND4["Run 3: 'France's capital city is Paris.'"]
        ND1 --> ND2
        ND1 --> ND3
        ND1 --> ND4
    end

    subgraph Det["Deterministic (Configured)"]
        D1["Same prompt + seed + temp=0"]
        D2["Run 1: 'The capital is Paris.'"]
        D3["Run 2: 'The capital is Paris.'"]
        D4["Run 3: 'The capital is Paris.'"]
        D1 --> D2
        D1 --> D3
        D1 --> D4
    end

    style NonDet fill:#ffce67,stroke:#333
    style Det fill:#56cc9d,stroke:#333,color:#fff

Sources of Non-Determinism

Source	Explanation	Controllable?
Sampling (temperature > 0)	Random token selection from distribution	Yes — set temperature=0
Top-p / Top-k filtering	Random selection within candidate set	Yes — set top_p=1, top_k=1
Floating-point non-determinism	GPU parallel operations not strictly ordered	Partially — depends on hardware
Batching effects	Different batch compositions may affect computation	No (server-side)
Model updates	Provider may update model without notice	No (use versioned models)
System prompt caching	Some providers cache and may route differently	No

How to Achieve Deterministic Output

Method	What it does	Guarantee Level
`temperature=0`	Greedy decoding (argmax)	High — nearly deterministic
`seed` parameter	Fixes random state for sampling	High (API-dependent)
`temperature=0` + `seed`	Both greedy and fixed state	Highest available
Self-hosted + fixed seed + deterministic CUDA	Full control over hardware	True determinism

When Determinism Matters

Use Case	Need Deterministic?	Why
Unit testing	Yes	Reproducible test assertions
Evaluation/benchmarks	Yes	Fair comparison across models
Caching	Yes	Same input → cache hit
Audit/compliance	Yes	Reproducible decisions
Creative writing	No	Variety is desired
Chat conversations	No	Natural variation is expected

Important Caveat

Even with temperature=0 and a seed, exact determinism is not always guaranteed:

GPU floating-point operations may vary across hardware versions
API providers may route requests to different hardware
Model quantization can introduce slight variations
OpenAI states: “deterministic outputs are not guaranteed” even with seed (but are “mostly deterministic”)

Q6: What are the main decoding strategies and when should you use each?

Answer:

Decoding is the process of selecting which token to generate next given the probability distribution from the model. The choice of decoding strategy dramatically affects output quality.

graph TD
    DECODE["Decoding Strategies"]
    DECODE --> DETERM["Deterministic"]
    DECODE --> STOCH["Stochastic (Sampling)"]
    DECODE --> HYBRID["Hybrid / Advanced"]

    DETERM --> GREEDY["Greedy Search
Pick argmax at each step"]
    DETERM --> BEAM["Beam Search
Track top-n hypotheses"]

    STOCH --> PURE["Pure Sampling
Sample from full distribution"]
    STOCH --> TOPK["Top-k Sampling
Sample from top k tokens"]
    STOCH --> TOPP["Top-p Sampling
Sample from nucleus"]
    STOCH --> TEMP_SAMP["Temperature Sampling
Reshape distribution then sample"]

    HYBRID --> SPEC["Speculative Decoding
Draft + verify"]
    HYBRID --> CONTRAST["Contrastive Decoding
Subtract weak model's distribution"]
    HYBRID --> GUIDED["Guided/Constrained
Enforce output structure"]

    style DECODE fill:#56cc9d,stroke:#333,color:#fff
    style DETERM fill:#6cc3d5,stroke:#333,color:#fff
    style STOCH fill:#ffce67,stroke:#333
    style HYBRID fill:#ff7851,stroke:#333,color:#fff

Detailed Comparison

Strategy	How it works	Pros	Cons
Greedy	Always pick highest probability token	Fast, deterministic, simple	Repetitive, misses better sequences
Beam Search	Track top-n partial sequences	Finds higher-probability sequences	Still repetitive, expensive, poor for open-ended
Top-k Sampling	Sample from top k tokens	Reduces nonsense, some diversity	Fixed k not adaptive to distribution
Top-p Sampling	Sample from smallest set covering p mass	Adaptive to uncertainty, natural	Slightly less predictable
Temperature + Sampling	Reshape distribution then sample	Fine-grained control	Need to tune parameter
Speculative Decoding	Small model drafts, large model verifies	2-3x faster, same quality	Needs draft model
Contrastive Decoding	Subtract amateur model’s preferences	Reduces repetition, more coherent	Complex setup
Constrained Decoding	Force output to follow grammar/schema	Guarantees valid structure	Limits expressiveness

Greedy Search: The Simplest Strategy

At each step, pick the token with the highest probability:

Problem: Greedy search is locally optimal but not globally optimal. A low-probability token now might lead to a much better overall sequence.

Example: “The dog” (0.4) → “has” (0.9) gives sequence probability 0.36, while “The nice” (0.5) → “woman” (0.4) gives 0.20. Greedy picks “nice” first but misses the better path.

Beam Search: Exploring Multiple Paths

Maintains num_beams parallel hypotheses:

Beam 1: "The" → "dog" → "has" → "a"      (prob: 0.36 × ...)
Beam 2: "The" → "nice" → "woman" → "is"   (prob: 0.20 × ...)
Beam 3: "The" → "cat" → "sat" → "on"      (prob: 0.15 × ...)

When to use beam search:

Translation (known output length)
Summarization (structured output)
NOT for open-ended generation (causes repetition)

When to Use Which Strategy

Task	Recommended Strategy	Why
Code generation	Greedy (temp=0)	Correctness over creativity
Translation	Beam search (beams=4-5)	Quality over diversity
Creative writing	Top-p=0.95, temp=0.8	Diversity and surprise
Chat/conversation	Top-p=0.9, temp=0.7	Natural but coherent
Structured extraction	Constrained decoding	Must follow schema
JSON output	Greedy + grammar constraints	Validity guaranteed
Fast inference	Speculative decoding	Speed with no quality loss

Q7: What are frequency penalty and presence penalty, and how do they reduce repetition?

Answer:

Frequency penalty and presence penalty are post-processing adjustments to token logits that discourage the model from repeating itself.

Mathematical Definitions

The logit for token is adjusted before sampling:

where is how many times token has appeared in the output so far.

graph TD
    subgraph FP["Frequency Penalty"]
        FP1["Penalizes proportionally to
how MANY times token appeared"]
        FP2["Token appeared 5× → big penalty"]
        FP3["Token appeared 1× → small penalty"]
        FP4["Effect: Reduces repetitive words"]
    end

    subgraph PP["Presence Penalty"]
        PP1["Penalizes equally if token
appeared AT ALL (binary)"]
        PP2["Token appeared 5× → same penalty as 1×"]
        PP3["Token never appeared → no penalty"]
        PP4["Effect: Encourages new topics"]
    end

    style FP fill:#6cc3d5,stroke:#333,color:#fff
    style PP fill:#ffce67,stroke:#333

Comparison

Aspect	Frequency Penalty	Presence Penalty
Scales with count?	Yes (proportional)	No (binary: appeared or not)
Range (OpenAI)	-2.0 to 2.0	-2.0 to 2.0
Primary effect	Reduces word-level repetition	Encourages topic diversity
Use case	Avoid saying “very very very…”	Avoid staying on same topic
Analogy	“Don’t repeat words”	“Talk about new things”

Practical Examples

Without penalties (both = 0): > “The weather is nice. The weather is really nice. The weather makes me happy. The weather…”

With frequency_penalty = 0.8: > “The weather is nice. It’s a beautiful day. The sunshine makes me happy. I think I’ll go outside…”

With presence_penalty = 1.0: > “The weather is nice. I’ve been reading a great book lately. My garden is blooming. Tomorrow I plan to cook…”

Repetition Penalty (Hugging Face)

Hugging Face uses a multiplicative repetition_penalty instead:

repetition_penalty = 1.0: No effect
repetition_penalty = 1.2: Moderate de-repetition (common default)
repetition_penalty > 1.5: Strong — may cause incoherence

Q8: What is `max_tokens` and how does it interact with the context window?

Answer:

max_tokens (or max_new_tokens) sets the maximum number of tokens the model will generate in its response. It’s a hard cap — generation stops even if the response is incomplete.

graph LR
    subgraph Budget["Token Budget Allocation"]
        direction LR
        CW["Context Window: 128K"]
        INPUT["Input tokens used: 50K"]
        AVAILABLE["Available for output: 78K"]
        MAX["max_tokens set: 4096"]
        ACTUAL["Actual output: min(4096, until EOS)"]
    end

    CW --> INPUT --> AVAILABLE --> MAX --> ACTUAL

    style Budget fill:#56cc9d,stroke:#333,color:#fff

Key Relationships

If you set max_tokens higher than available space, the API will either:

Silently cap it at the available space
Return an error

`max_tokens` vs. `max_new_tokens`

Parameter	Framework	What it means
`max_tokens`	OpenAI, Anthropic APIs	Max tokens in the completion
`max_new_tokens`	Hugging Face Transformers	Max new tokens to generate (same concept)
`max_length`	Hugging Face (older)	Max total length (input + output)

Why Generation Stops

Generation terminates when any of these conditions is met:

Condition	Description
`max_tokens` reached	Hard output length limit
EOS token generated	Model naturally finishes its response
Stop sequence matched	A specified string pattern is found
Context window full	Input + output fills the entire window

Practical Implications

Setting	Effect	Risk
Too low (e.g., 50)	Responses get cut off mid-sentence	Incomplete, incoherent outputs
Too high (e.g., 16384)	Model can write as much as it wants	Higher cost, potential rambling
Right-sized	Complete responses without waste	Requires knowing task needs

Cost Optimization

Since APIs charge per token:

Set max_tokens appropriate to the task (not arbitrarily high)
Use stop sequences to terminate early
Monitor actual token usage vs. max_tokens budget

Q9: What are stop sequences and how do they control generation?

Answer:

Stop sequences are strings that, when generated by the model, immediately terminate generation. They’re a powerful mechanism for controlling output format and length.

graph TD
    GEN["Model Generating..."]
    GEN --> CHECK{"Generated text
contains stop sequence?"}
    CHECK -->|"No"| CONT["Continue generating"]
    CONT --> GEN
    CHECK -->|"Yes"| STOP["Stop immediately
Return output (stop seq excluded)"]

    style GEN fill:#6cc3d5,stroke:#333,color:#fff
    style STOP fill:#56cc9d,stroke:#333,color:#fff
    style CHECK fill:#ffce67,stroke:#333

Common Stop Sequence Use Cases

Use Case	Stop Sequences	Purpose
Single-line answer	`["\n"]`	Prevent multi-line responses
Code function	`["\n\n", "def ", "class "]`	Stop after one function
Structured QA	`["Q:", "Question:"]`	Stop before generating next question
Chat role-play	`["User:", "Human:"]`	Prevent model from simulating user
JSON extraction	`["}"]` or `["}\n"]`	Stop after closing brace
Numbered list	`["11."]`	Limit to 10 items

Example: Controlling Multi-Turn Chat

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "List 3 fruits"}],
    stop=["\n\n", "4."],  # Stop after 3 items
    max_tokens=200
)

Without stop sequences: Model might continue listing dozens of fruits or add commentary.

With stop sequences: Generation stops cleanly after the third item.

Stop Sequences vs. Other Stopping Mechanisms

Mechanism	How it works	Granularity
Stop sequences	Match specific text strings	Fine (exact strings)
max_tokens	Hard token count limit	Coarse (may cut mid-word)
EOS token	Model decides it’s done	Model-controlled
Constrained decoding	Grammar forces valid endings	Structural

Best Practices

Include the delimiter that separates outputs (e.g., "\n\n" between paragraphs)
Test with variations — models might generate "\n " instead of "\n\n"
Combine with max_tokens as a safety net
Don’t over-specify — too many stop sequences can cause premature truncation

Q10: How do you choose the right configuration for different LLM tasks?

Answer:

Choosing the right parameters is about matching the creativity-accuracy tradeoff to your specific task requirements.

graph LR
    subgraph Spectrum["Creativity ↔ Accuracy Spectrum"]
        direction LR
        DET["🎯 Deterministic
temp=0, top_p=1"]
        CON["🔒 Conservative
temp=0.2, top_p=0.9"]
        BAL["⚖️ Balanced
temp=0.7, top_p=0.9"]
        CRE["🎨 Creative
temp=1.0, top_p=0.95"]
        WILD["🌀 Wild
temp=1.5, top_p=1.0"]
    end

    DET --> CON --> BAL --> CRE --> WILD

    style DET fill:#56cc9d,stroke:#333,color:#fff
    style CON fill:#6cc3d5,stroke:#333,color:#fff
    style BAL fill:#ffce67,stroke:#333
    style CRE fill:#ff7851,stroke:#333,color:#fff

Decision Framework

Question	If Yes →	If No →
Does output need to be exactly correct?	temp=0, greedy	Consider sampling
Is creativity/variety valued?	temp=0.7-1.0	temp=0-0.3
Must output follow strict format?	Constrained decoding, low temp	Higher freedom
Running evaluations/benchmarks?	temp=0, seed set	Doesn’t matter
Is this user-facing chat?	temp=0.7, penalties for variety	Task-dependent
Generating multiple candidates?	Higher temp, n>1	Standard settings

Complete Configuration Recipes

Recipe 1: Code Generation

temperature: 0.0
top_p: 1.0
max_tokens: 2048
stop: ["\n\n\n", "```"]
frequency_penalty: 0.0

Why: Code requires precision. Any “creativity” means bugs.

Recipe 2: Customer Support Bot

temperature: 0.3
top_p: 0.9
max_tokens: 512
presence_penalty: 0.2
stop: ["Human:", "Customer:"]

Why: Slightly varied but consistent, professional responses.

Recipe 3: Creative Story Writing

temperature: 0.9
top_p: 0.95
max_tokens: 4096
frequency_penalty: 0.7
presence_penalty: 0.5

Why: Maximum variety, avoids repetition, explores narrative directions.

Recipe 4: Data Extraction (JSON)

temperature: 0.0
top_p: 1.0
max_tokens: 256
response_format: {"type": "json_object"}
stop: ["}\n"]

Why: Must produce valid, consistent structured output.

Recipe 5: Brainstorming / Ideation

temperature: 1.2
top_p: 0.95
max_tokens: 1024
frequency_penalty: 1.0
presence_penalty: 1.5
n: 5

Why: Generate diverse ideas; high penalties force exploration of new territory.

Common Mistakes

Mistake	Problem	Fix
`temperature=0` for creative tasks	Bland, repetitive output	Increase to 0.7-1.0
`temperature=1.0` for factual tasks	Hallucinations, wrong facts	Decrease to 0-0.3
Ignoring `max_tokens`	Unexpected costs, truncation	Always set appropriate limit
Setting both `temperature` and `top_p` low	Over-constrained, degenerate	Usually modify one, keep other default
No stop sequences in agentic loops	Model generates beyond intended boundary	Add role/delimiter stops

Summary Table

#	Topic	Key Concept
1	API Parameters	temperature, top_p, max_tokens, penalties, stop sequences
2	Temperature	Controls distribution sharpness: 0=greedy, 1=natural, >1=chaotic
3	Top-p vs. Top-k	Fixed-size (k) vs. adaptive probability mass (p) filtering
4	Context Window	Total input+output token budget; affects cost, latency, quality
5	Determinism	temp=0 + seed for reproducibility; true determinism is hard
6	Decoding Strategies	Greedy, beam search, sampling, speculative, constrained
7	Penalties	frequency_penalty (proportional) vs. presence_penalty (binary)
8	max_tokens	Hard output cap; interacts with context window budget
9	Stop Sequences	String patterns that terminate generation cleanly
10	Configuration Recipes	Match creativity-accuracy tradeoff to task requirements

What’s Next?

This article covered the practical configuration knowledge tested in LLM engineering interviews. For related content:

Core LLM concepts (transformers, RAG, RLHF): LLM Interview QA - 1
Advanced topics (scaling, agents, inference): LLM Interview QA - 2
ML fundamentals: ML Interview QA - 1 and ML Interview QA - 2