graph LR
subgraph Transformer["Transformer Architecture"]
direction TB
INPUT["Input Embeddings <br/>+ Positional Encoding"]
ENC["Encoder Stack (N layers)"]
DEC["Decoder Stack (N layers)"]
OUTPUT["Output Probabilities"]
INPUT --> ENC
ENC --> DEC
DEC --> OUTPUT
end
subgraph Encoder_Layer["Each Encoder Layer"]
SA["Multi-Head Self-Attention"]
FFN["Feed-Forward Network"]
LN1["Layer Norm + Residual"]
LN2["Layer Norm + Residual"]
SA --> LN1 --> FFN --> LN2
end
subgraph Decoder_Layer["Each Decoder Layer"]
MSA["Masked Multi-Head Self-Attention"]
CA["Cross-Attention (to Encoder)"]
FFN2["Feed-Forward Network"]
MSA --> CA --> FFN2
end
style Transformer fill:#56cc9d,stroke:#333,color:#fff
style Encoder_Layer fill:#6cc3d5,stroke:#333,color:#fff
style Decoder_Layer fill:#ffce67,stroke:#333
LLM Interview QA - 1
LLM interview, large language model interview questions, transformer architecture, self-attention mechanism, tokenization, fine-tuning LLM, RLHF, hallucination, prompt engineering, RAG, temperature sampling, encoder decoder
Introduction
This is Part 1 of our LLM Interview QA series. It covers 10 foundational questions that appear in nearly every LLM Engineer, AI Engineer, and Applied ML interview — from startups to FAANG. Each answer goes beyond surface-level definitions with diagrams, concrete examples, and real-world applications.
This series complements our ML Interview series. For foundational machine learning concepts, see ML Interview QA - 1. For evaluation metrics and feature engineering, see ML Interview QA - 2.
Q1: What is the Transformer architecture and why did it replace RNNs/LSTMs?
Answer:
The Transformer is a neural network architecture introduced in the 2017 paper “Attention Is All You Need”. It relies entirely on self-attention mechanisms instead of recurrence or convolution to model dependencies in sequences.
Why Transformers replaced RNNs/LSTMs
| Aspect | RNN/LSTM | Transformer |
|---|---|---|
| Parallelization | Sequential (word by word) | Fully parallel |
| Long-range dependencies | Struggles (vanishing gradient) | Handles via attention |
| Training speed | Slow | Much faster on GPUs |
| Context window | Limited by hidden state | Limited by memory (can be very large) |
| Positional info | Implicit in sequence order | Explicit positional encoding |
Key Insight
RNNs process tokens sequentially — to understand the relationship between the first and last word in a sentence, information must pass through every intermediate hidden state. Transformers compute attention scores between all pairs of tokens simultaneously, making them vastly more efficient and effective at capturing long-range dependencies.
Q2: How does the Self-Attention mechanism work?
Answer:
Self-attention allows each token in a sequence to attend to every other token, computing a weighted sum of their representations based on relevance.
The QKV Framework
For each input token, three vectors are computed:
- Query (Q): “What am I looking for?”
- Key (K): “What do I contain?”
- Value (V): “What information do I provide?”
The attention score is computed as:
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
where d_k is the dimension of the key vectors (the scaling factor prevents dot products from growing too large).
graph LR
subgraph Self_Attention["Self-Attention Computation"]
I["Input Embeddings"] --> Q["Q = X · W_Q"]
I --> K["K = X · W_K"]
I --> V["V = X · W_V"]
Q --> DOT["Q · K^T"]
K --> DOT
DOT --> SCALE["÷ √d_k"]
SCALE --> SOFT["Softmax"]
SOFT --> MUL["× V"]
V --> MUL
MUL --> OUT["Output"]
end
style Self_Attention fill:#6cc3d5,stroke:#333,color:#fff
Multi-Head Attention
Instead of one attention function, Transformers run multiple attention heads in parallel (e.g., 8 or 16 heads). Each head learns different relationships:
- One head might learn syntactic relationships (subject-verb)
- Another might learn coreference (pronouns to their antecedents)
- Another might learn positional proximity
The outputs of all heads are concatenated and linearly projected.
Example
For the sentence: “The cat sat on the mat because it was tired”
The self-attention mechanism helps the model understand that “it” refers to “the cat” — the attention weight between “it” and “cat” will be high, while the weight between “it” and “mat” will be lower.
Q3: What is tokenization and what are the main tokenization strategies used in LLMs?
Answer:
Tokenization is the process of splitting text into smaller units (tokens) that the model can process. Tokens are the fundamental input units for LLMs.
Main Tokenization Strategies
| Strategy | Description | Example (“unhappiness”) | Used By |
|---|---|---|---|
| Word-level | Split by spaces/punctuation | [“unhappiness”] | Early models |
| Character-level | Each character is a token | [“u”,“n”,“h”,“a”,“p”,“p”,“i”,“n”,“e”,“s”,“s”] | Some small models |
| BPE (Byte Pair Encoding) | Iteratively merge frequent character pairs | [“un”, “happiness”] | GPT-2, GPT-3, GPT-4 |
| WordPiece | Like BPE but maximizes likelihood | [“un”, “##happiness”] | BERT |
| SentencePiece/Unigram | Probabilistic subword model | [“▁un”, “happi”, “ness”] | T5, LLaMA |
graph TD
TEXT["Raw Text: 'The cats are playing'"]
TEXT --> WL["Word-level: ['The', 'cats', 'are', 'playing']"]
TEXT --> BPE["BPE: ['The', ' c', 'ats', ' are', ' play', 'ing']"]
TEXT --> WP["WordPiece: ['The', 'cats', 'are', 'play', '##ing']"]
style TEXT fill:#56cc9d,stroke:#333,color:#fff
style BPE fill:#6cc3d5,stroke:#333,color:#fff
style WP fill:#ffce67,stroke:#333
Why Subword Tokenization?
- Handles unknown words: Can represent any word by breaking it into known subwords
- Efficient vocabulary: Balances vocabulary size with sequence length
- Morphological awareness: Captures meaningful parts (prefixes, suffixes, roots)
Practical Considerations
- 1 token ≈ 4 characters (English) or ≈ 0.75 words
- Vocabulary sizes: GPT-4 uses ~100k tokens, LLaMA uses ~32k tokens
- Non-English languages and code often require more tokens per word
Q4: What is the difference between Encoder-only, Decoder-only, and Encoder-Decoder models?
Answer:
graph LR
subgraph EO["Encoder-Only"]
EO1["Bidirectional attention"]
EO2["Sees all tokens at once"]
EO3["Best for: Understanding"]
EO4["Examples: BERT, RoBERTa"]
end
subgraph DO["Decoder-Only"]
DO1["Causal (left-to-right) attention"]
DO2["Each token sees only prior tokens"]
DO3["Best for: Generation"]
DO4["Examples: GPT-4, LLaMA, Claude"]
end
subgraph ED["Encoder-Decoder"]
ED1["Encoder: bidirectional"]
ED2["Decoder: causal + cross-attention"]
ED3["Best for: Seq-to-Seq tasks"]
ED4["Examples: T5, BART, Flan-T5"]
end
style EO fill:#56cc9d,stroke:#333,color:#fff
style DO fill:#6cc3d5,stroke:#333,color:#fff
style ED fill:#ffce67,stroke:#333
Detailed Comparison
| Aspect | Encoder-Only | Decoder-Only | Encoder-Decoder |
|---|---|---|---|
| Attention | Bidirectional | Causal (masked) | Both |
| Pre-training | Masked Language Modeling | Next Token Prediction | Span corruption / denoising |
| Strengths | Classification, NER, embeddings | Text generation, reasoning | Translation, summarization |
| Context | Full input visibility | Only left context | Full input → sequential output |
| Scaling trend | Less common at scale | Dominant paradigm (GPT-4, Claude) | Used for specific tasks (T5) |
Why Decoder-Only Dominates Today
Most modern LLMs (GPT-4, Claude, LLaMA, Gemini) are decoder-only because:
- Simplicity: One unified architecture for all tasks
- Scalability: Easier to scale with more parameters
- Generality: Can handle classification, generation, and reasoning via prompting
- Emergent abilities: Larger decoder-only models exhibit chain-of-thought reasoning
Q5: What is fine-tuning and what are the main approaches for adapting LLMs?
Answer:
Fine-tuning is the process of further training a pre-trained LLM on a specific dataset or task to customize its behavior.
graph TD
PT["Pre-trained LLM<br/>(trained on internet-scale data)"]
PT --> FFT["Full Fine-Tuning<br/>Update ALL parameters"]
PT --> PEFT["Parameter-Efficient Fine-Tuning<br/>Update FEW parameters"]
PT --> RLHF_node["RLHF / Alignment<br/>Human preference training"]
PEFT --> LORA["LoRA"]
PEFT --> PREFIX["Prefix Tuning"]
PEFT --> ADAPTER["Adapters"]
PEFT --> QLORA["QLoRA"]
style PT fill:#56cc9d,stroke:#333,color:#fff
style FFT fill:#ff7851,stroke:#333,color:#fff
style PEFT fill:#6cc3d5,stroke:#333,color:#fff
style RLHF_node fill:#ffce67,stroke:#333
Fine-Tuning Approaches
| Approach | What it does | Parameters Updated | Cost |
|---|---|---|---|
| Full Fine-Tuning | Updates all model weights | 100% | Very high (multiple GPUs) |
| LoRA | Adds low-rank matrices to attention layers | ~0.1-1% | Low |
| QLoRA | LoRA + 4-bit quantization | ~0.1-1% | Very low |
| Prefix Tuning | Prepends trainable vectors to inputs | <1% | Low |
| Adapters | Inserts small trainable layers | ~1-5% | Low |
LoRA (Low-Rank Adaptation) — Most Popular
LoRA freezes the pre-trained weights and injects trainable low-rank decomposition matrices:
W' = W + \Delta W = W + BA
where B \in \mathbb{R}^{d \times r} and A \in \mathbb{R}^{r \times d}, with rank r \ll d.
When to Use What
- Prompt engineering first — no training needed, quick iteration
- LoRA/QLoRA — when you need task-specific behavior with limited compute
- Full fine-tuning — when you have large datasets and significant compute budget
- RLHF — when aligning model outputs with human preferences
Q6: What is RLHF (Reinforcement Learning from Human Feedback) and how does it work?
Answer:
RLHF is a training technique that aligns LLM outputs with human preferences. It’s the key process that makes models like ChatGPT helpful, harmless, and honest.
graph TD
subgraph Step1["Step 1: Supervised Fine-Tuning (SFT)"]
SFT1["Pre-trained LLM"]
SFT2["Human-written demonstrations"]
SFT3["Fine-tuned model (SFT model)"]
SFT1 --> SFT2 --> SFT3
end
subgraph Step2["Step 2: Reward Model Training"]
RM1["SFT model generates multiple responses"]
RM2["Humans rank responses by quality"]
RM3["Train reward model on rankings"]
RM1 --> RM2 --> RM3
end
subgraph Step3["Step 3: PPO Optimization"]
PPO1["SFT model generates response"]
PPO2["Reward model scores it"]
PPO3["PPO updates policy to maximize reward"]
PPO4["KL penalty prevents drift from SFT"]
PPO1 --> PPO2 --> PPO3
PPO3 --> PPO4
end
Step1 --> Step2 --> Step3
style Step1 fill:#56cc9d,stroke:#333,color:#fff
style Step2 fill:#6cc3d5,stroke:#333,color:#fff
style Step3 fill:#ffce67,stroke:#333
The Three Steps
- SFT (Supervised Fine-Tuning): Train the base model on high-quality human-written responses
- Reward Model: Train a separate model to score responses based on human preference rankings
- RL Optimization (PPO): Use the reward model as a signal to optimize the LLM’s outputs
Alternatives to RLHF
| Method | Approach | Advantage |
|---|---|---|
| DPO (Direct Preference Optimization) | Directly optimize from preferences without a reward model | Simpler, more stable training |
| RLAIF | Use AI feedback instead of human feedback | Cheaper, more scalable |
| Constitutional AI | Self-critique against a set of principles | Less human annotation needed |
Why RLHF Matters
Without RLHF, base LLMs tend to:
- Continue text rather than answer questions
- Generate toxic, biased, or harmful content
- Hallucinate confidently
- Ignore user instructions
Q7: What are hallucinations in LLMs and how can they be mitigated?
Answer:
Hallucinations are confident-sounding outputs that are factually incorrect, nonsensical, or unfaithful to the provided context. They are one of the biggest challenges in deploying LLMs.
Types of Hallucinations
graph TD
H["LLM Hallucinations"]
H --> INT["Intrinsic Hallucination<br/>Contradicts the source input"]
H --> EXT["Extrinsic Hallucination<br/>Cannot be verified from source"]
INT --> INT_EX["Example: Summary says 'John went to Paris'<br/>when source says 'John went to London'"]
EXT --> EXT_EX["Example: Model adds details<br/>not present in any source"]
style H fill:#ff7851,stroke:#333,color:#fff
style INT fill:#ffce67,stroke:#333
style EXT fill:#6cc3d5,stroke:#333,color:#fff
Causes
| Cause | Explanation |
|---|---|
| Training data noise | Incorrect or contradictory information in pre-training corpus |
| Knowledge cutoff | Model generates outdated information |
| Pattern completion | Model prioritizes fluency over accuracy |
| Exposure bias | Errors compound during autoregressive generation |
| Lack of grounding | No mechanism to verify claims against facts |
Mitigation Strategies
| Strategy | How it helps |
|---|---|
| RAG (Retrieval-Augmented Generation) | Grounds responses in retrieved documents |
| Chain-of-thought prompting | Forces step-by-step reasoning, reduces logical errors |
| Temperature reduction | Lowers randomness, picks more likely tokens |
| Self-consistency | Generate multiple answers, pick the most common |
| Constrained decoding | Restrict outputs to valid formats |
| Citation requirements | Force model to cite sources |
| Fine-tuning on verified data | Teach the model to say “I don’t know” |
Real-World Impact
Hallucinations are critical in high-stakes applications (legal, medical, financial). Production LLM systems almost always use RAG or other grounding techniques to minimize hallucinations.
Q8: What is Retrieval-Augmented Generation (RAG) and why is it important?
Answer:
RAG combines a retrieval system with a generative LLM to ground responses in external knowledge, reducing hallucinations and enabling access to up-to-date or domain-specific information.
graph LR
Q["User Query"]
Q --> EMB["Embed Query"]
EMB --> SEARCH["Vector Search<br/>(retrieve top-k documents)"]
DB["Document Store<br/>(vector database)"] --> SEARCH
SEARCH --> CONTEXT["Retrieved Context"]
CONTEXT --> PROMPT["Augmented Prompt<br/>(query + context)"]
Q --> PROMPT
PROMPT --> LLM["LLM generates answer"]
LLM --> ANS["Grounded Response"]
style Q fill:#56cc9d,stroke:#333,color:#fff
style SEARCH fill:#6cc3d5,stroke:#333,color:#fff
style LLM fill:#ffce67,stroke:#333
style ANS fill:#56cc9d,stroke:#333,color:#fff
RAG Pipeline Components
| Component | Purpose | Common Tools |
|---|---|---|
| Document Loader | Ingest documents (PDF, web, DB) | LangChain, LlamaIndex |
| Chunking | Split documents into manageable pieces | Recursive, semantic splitting |
| Embedding Model | Convert text to dense vectors | OpenAI ada-002, BGE, E5 |
| Vector Store | Store and search embeddings | Pinecone, Weaviate, ChromaDB, FAISS |
| Retriever | Find relevant chunks for a query | Similarity search, hybrid search |
| Generator (LLM) | Produce final answer from context | GPT-4, Claude, LLaMA |
RAG vs. Fine-Tuning
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| Knowledge update | Instant (update document store) | Requires retraining |
| Cost | Lower (no GPU training) | Higher (compute for training) |
| Hallucination | Reduced (grounded in docs) | Can still hallucinate |
| Use case | Dynamic knowledge, Q&A | Style/behavior change |
| Transparency | Can cite sources | Black-box |
When to Use RAG
- Knowledge changes frequently (news, documentation)
- Need verifiable, source-cited answers
- Domain-specific knowledge not in pre-training data
- Legal/compliance requirements for traceability
Q9: What is prompt engineering and what are the key techniques?
Answer:
Prompt engineering is the practice of designing inputs to LLMs to elicit desired outputs without modifying model weights. It’s the most accessible and cost-effective way to control LLM behavior.
Key Prompting Techniques
graph TD
PE["Prompt Engineering Techniques"]
PE --> ZS["Zero-Shot<br/>'Classify this review as positive/negative'"]
PE --> FS["Few-Shot<br/>'Here are 3 examples, now do this one'"]
PE --> COT["Chain-of-Thought<br/>'Think step by step'"]
PE --> SC["Self-Consistency<br/>Sample multiple CoT paths, majority vote"]
PE --> TOT["Tree-of-Thought<br/>Explore multiple reasoning branches"]
PE --> ROLE["Role Prompting<br/>'You are an expert data scientist...'"]
style PE fill:#56cc9d,stroke:#333,color:#fff
style COT fill:#6cc3d5,stroke:#333,color:#fff
style SC fill:#ffce67,stroke:#333
Comparison of Techniques
| Technique | When to Use | Performance Boost |
|---|---|---|
| Zero-shot | Simple tasks, large models | Baseline |
| Few-shot | Need format guidance, smaller models | +10-30% on structured tasks |
| Chain-of-thought | Reasoning, math, logic | +20-50% on reasoning tasks |
| Self-consistency | High-accuracy requirements | +5-15% over single CoT |
| Tree-of-thought | Complex multi-step problems | Best for planning/search |
System Prompt Best Practices
- Be specific: “Extract the person’s name, company, and role” > “Extract information”
- Define format: Specify JSON, markdown, or other output structures
- Set constraints: “Answer only based on the provided context”
- Provide examples: Show input-output pairs for complex tasks
- Assign a role: “You are a senior Python developer reviewing code”
Temperature and Sampling Parameters
| Parameter | Effect | Use Case |
|---|---|---|
| Temperature (0-2) | Controls randomness. Lower = deterministic | 0 for factual, 0.7-1.0 for creative |
| Top-p (nucleus sampling) | Considers tokens within cumulative probability p | 0.9 for balanced generation |
| Top-k | Considers only top k most likely tokens | Limits vocabulary for generation |
| Frequency penalty | Reduces repetition | Longer outputs without loops |
Q10: What are the key challenges and considerations when deploying LLMs in production?
Answer:
Deploying LLMs in production involves challenges beyond model accuracy — including latency, cost, safety, and reliability.
graph TD
PROD["LLM Production Challenges"]
PROD --> PERF["Performance"]
PROD --> COST["Cost"]
PROD --> SAFETY["Safety & Guardrails"]
PROD --> EVAL["Evaluation"]
PROD --> OPS["Operations"]
PERF --> P1["Latency (TTFT, TPS)"]
PERF --> P2["Throughput"]
PERF --> P3["Context window limits"]
COST --> C1["Token costs"]
COST --> C2["Infrastructure"]
COST --> C3["Caching strategies"]
SAFETY --> S1["Content filtering"]
SAFETY --> S2["PII detection"]
SAFETY --> S3["Prompt injection defense"]
EVAL --> E1["Automated metrics"]
EVAL --> E2["Human evaluation"]
EVAL --> E3["A/B testing"]
OPS --> O1["Monitoring & observability"]
OPS --> O2["Version management"]
OPS --> O3["Fallback strategies"]
style PROD fill:#56cc9d,stroke:#333,color:#fff
style PERF fill:#6cc3d5,stroke:#333,color:#fff
style SAFETY fill:#ff7851,stroke:#333,color:#fff
style COST fill:#ffce67,stroke:#333
Key Production Patterns
| Pattern | Purpose | Implementation |
|---|---|---|
| Caching | Reduce cost & latency | Semantic cache (similar queries), exact cache |
| Streaming | Improve perceived latency | Server-sent events, token-by-token delivery |
| Guardrails | Prevent harmful outputs | Input/output validators, content filters |
| Fallbacks | Handle failures gracefully | Model cascading, rule-based backup |
| Rate limiting | Manage costs and abuse | Token budgets, per-user limits |
| Observability | Monitor quality over time | Log prompts/responses, track metrics |
Optimization Techniques
| Technique | Benefit |
|---|---|
| Quantization (4-bit, 8-bit) | 2-4x memory reduction with minimal quality loss |
| KV-cache optimization | Faster inference for long contexts |
| Speculative decoding | 2-3x speed improvement |
| Model distillation | Smaller, faster models that mimic larger ones |
| Prompt compression | Reduce token count while preserving meaning |
| Batching | Higher throughput for concurrent requests |
Evaluation in Production
- Automated: BLEU, ROUGE, BERTScore for generation quality
- LLM-as-Judge: Use a stronger model to evaluate outputs
- Human feedback: Thumbs up/down, preference ratings
- Task-specific: Accuracy, F1, faithfulness scores
- Safety: Toxicity rates, refusal rates, PII leakage
Security Considerations
- Prompt injection: Adversarial inputs that override system instructions
- Data leakage: Model revealing training data or system prompts
- PII exposure: Generating or storing personally identifiable information
- Jailbreaking: Users bypassing safety guardrails
Summary Table
| # | Topic | Key Concept |
|---|---|---|
| 1 | Transformer Architecture | Self-attention replaces recurrence for parallel, long-range processing |
| 2 | Self-Attention | QKV mechanism computes token relevance scores |
| 3 | Tokenization | Subword strategies (BPE, WordPiece) balance vocabulary and sequence length |
| 4 | Model Types | Encoder-only, decoder-only, encoder-decoder serve different tasks |
| 5 | Fine-Tuning | LoRA/QLoRA enable efficient adaptation with minimal parameters |
| 6 | RLHF | Three-step alignment: SFT → Reward Model → PPO |
| 7 | Hallucinations | Confident wrong outputs; mitigated by RAG, CoT, temperature |
| 8 | RAG | Retrieval + generation for grounded, up-to-date responses |
| 9 | Prompt Engineering | Zero-shot, few-shot, CoT, and sampling parameters |
| 10 | Production Deployment | Latency, cost, safety, evaluation, and operational concerns |
What’s Next?
This article covered the foundational LLM concepts most commonly tested in interviews. For deeper dives into specific topics:
- ML fundamentals that underpin LLMs: ML Interview QA - 1
- Evaluation metrics and data handling: ML Interview QA - 2