graph TD
CONFIG["LLM Configuration Parameters"]
CONFIG --> GEN["Generation Control"]
CONFIG --> SAMP["Sampling Parameters"]
CONFIG --> OUT["Output Control"]
CONFIG --> SYS["System Parameters"]
GEN --> G1["temperature"]
GEN --> G2["top_p (nucleus)"]
GEN --> G3["top_k"]
GEN --> G4["seed"]
SAMP --> S1["frequency_penalty"]
SAMP --> S2["presence_penalty"]
SAMP --> S3["repetition_penalty"]
SAMP --> S4["logit_bias"]
OUT --> O1["max_tokens / max_new_tokens"]
OUT --> O2["stop sequences"]
OUT --> O3["n (num_return_sequences)"]
OUT --> O4["stream"]
SYS --> SY1["model"]
SYS --> SY2["system prompt"]
SYS --> SY3["response_format"]
SYS --> SY4["tools / functions"]
style CONFIG fill:#56cc9d,stroke:#333,color:#fff
style GEN fill:#6cc3d5,stroke:#333,color:#fff
style SAMP fill:#ffce67,stroke:#333
style OUT fill:#ff7851,stroke:#333,color:#fff
LLM Interview QA - 3
LLM interview, LLM parameters, temperature, top-p nucleus sampling, top-k sampling, context window, decoding strategies, greedy decoding, beam search, deterministic generation, max tokens, frequency penalty, stop sequences
Introduction
This is Part 3 of our LLM Interview QA series, focused on LLM configuration and generation control. Understanding how to configure LLM parameters — temperature, sampling strategies, context windows, and decoding methods — is essential for building reliable AI systems.
For foundational LLM concepts (transformers, attention, RAG, RLHF), see LLM Interview QA - 1. For advanced topics (scaling, quantization, agents), see LLM Interview QA - 2. For ML fundamentals, see ML Interview QA - 1.
Q1: What are the main configurable parameters when calling an LLM API?
Answer:
When making an LLM API call, several parameters control the behavior, quality, and cost of the generated output.
Parameter Overview
| Parameter | Range | Default | Purpose |
|---|---|---|---|
temperature |
0.0 – 2.0 | 1.0 | Controls randomness of output |
top_p |
0.0 – 1.0 | 1.0 | Nucleus sampling threshold |
top_k |
1 – vocab_size | 50 (varies) | Limits token candidates |
max_tokens |
1 – context_limit | Model-specific | Maximum output length |
frequency_penalty |
-2.0 – 2.0 | 0.0 | Penalizes repeated tokens |
presence_penalty |
-2.0 – 2.0 | 0.0 | Encourages topic diversity |
seed |
Any integer | None | Enables deterministic output |
stop |
List of strings | None | Stops generation at specific tokens |
n |
1+ | 1 | Number of completions to generate |
Practical Configuration Examples
| Use Case | temperature | top_p | max_tokens | Other |
|---|---|---|---|---|
| Code generation | 0.0 – 0.2 | 1.0 | 2048 | stop=["\n\n"] |
| Creative writing | 0.8 – 1.2 | 0.95 | 4096 | frequency_penalty=0.5 |
| Data extraction | 0.0 | 1.0 | 512 | response_format=json |
| Chat conversation | 0.7 | 0.9 | 1024 | presence_penalty=0.3 |
| Factual Q&A | 0.0 – 0.3 | 1.0 | 256 | — |
Q2: What is temperature and how does it affect LLM output?
Answer:
Temperature controls the randomness of the probability distribution over the vocabulary at each generation step. It’s applied to the logits before the softmax function.
Mathematical Definition
Given logits z_i for each token i in the vocabulary:
P(w_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}
where T is the temperature.
graph LR
subgraph T0["Temperature = 0 (Greedy)"]
T0_1["Token A: 99.9%"]
T0_2["Token B: 0.1%"]
T0_3["Token C: ~0%"]
end
subgraph T07["Temperature = 0.7"]
T07_1["Token A: 75%"]
T07_2["Token B: 20%"]
T07_3["Token C: 5%"]
end
subgraph T1["Temperature = 1.0 (Default)"]
T1_1["Token A: 60%"]
T1_2["Token B: 25%"]
T1_3["Token C: 15%"]
end
subgraph T2["Temperature = 2.0"]
T2_1["Token A: 40%"]
T2_2["Token B: 32%"]
T2_3["Token C: 28%"]
end
style T0 fill:#56cc9d,stroke:#333,color:#fff
style T07 fill:#6cc3d5,stroke:#333,color:#fff
style T1 fill:#ffce67,stroke:#333
style T2 fill:#ff7851,stroke:#333,color:#fff
Effect of Temperature
| Temperature | Distribution | Behavior | Output Character |
|---|---|---|---|
| T → 0 | Extremely peaked | Always picks highest-probability token | Deterministic, repetitive, safe |
| T = 0.3 | Slightly softened | Mostly picks top tokens, rare surprises | Conservative, coherent |
| T = 0.7 | Moderately spread | Balanced between likely and creative | Good default for most tasks |
| T = 1.0 | Original distribution | Model’s “natural” uncertainty | Raw model behavior |
| T > 1.0 | Flattened | Low-probability tokens become likely | Creative but potentially incoherent |
| T = 2.0 | Nearly uniform | Almost random selection | Chaotic, nonsensical |
Intuition
Think of temperature as a “creativity knob”:
- Low temperature (0–0.3): The model is confident and focused — it picks the most obvious next word. Great for factual tasks, code, structured extraction.
- Medium temperature (0.5–0.8): The model is balanced — it explores alternatives while staying coherent. Best for general chat and writing.
- High temperature (1.0+): The model is adventurous — it considers unlikely words, producing surprising or creative outputs.
Common Interview Follow-Up: “What does temperature=0 actually mean?”
Setting temperature=0 is a shortcut for greedy decoding — the model always selects the single highest-probability token. However:
- It’s still based on floating-point arithmetic, so minor non-determinism can occur across hardware
- Most APIs interpret
temperature=0as “return the argmax token” deterministically - Some providers require setting a
seedparameter for guaranteed reproducibility
Q3: What is the difference between Top-p (nucleus) sampling and Top-k sampling?
Answer:
Both Top-p and Top-k are token filtering strategies that limit which tokens are considered during generation, but they differ in how they determine the candidate set.
Top-k Sampling
Select the k most probable tokens and redistribute probability among them:
V_{\text{top-k}} = \{w_1, w_2, \ldots, w_k\} \quad \text{(ordered by probability)}
Top-p (Nucleus) Sampling
Select the smallest set of tokens whose cumulative probability exceeds p:
V_{\text{top-p}} = \text{smallest } V' \text{ such that } \sum_{w \in V'} P(w) \geq p
graph TD
subgraph TopK["Top-k = 3 (Fixed Size)"]
direction LR
K1["'the' (0.40) ✓"]
K2["'a' (0.25) ✓"]
K3["'my' (0.15) ✓"]
K4["'his' (0.10) ✗"]
K5["'our' (0.05) ✗"]
K6["'their' (0.03) ✗"]
end
subgraph TopP["Top-p = 0.9 (Dynamic Size)"]
direction LR
P1["'the' (0.40) ✓ → cumulative: 0.40"]
P2["'a' (0.25) ✓ → cumulative: 0.65"]
P3["'my' (0.15) ✓ → cumulative: 0.80"]
P4["'his' (0.10) ✓ → cumulative: 0.90 ≥ p"]
P5["'our' (0.05) ✗"]
P6["'their' (0.03) ✗"]
end
style TopK fill:#6cc3d5,stroke:#333,color:#fff
style TopP fill:#56cc9d,stroke:#333,color:#fff
Key Differences
| Aspect | Top-k | Top-p |
|---|---|---|
| Candidate set size | Fixed (always k tokens) | Dynamic (varies per step) |
| Adapts to distribution shape | No — same k regardless of certainty | Yes — fewer tokens when confident |
| Risk when distribution is peaked | Includes unlikely tokens unnecessarily | Naturally narrows to top few |
| Risk when distribution is flat | May exclude reasonable tokens | Naturally includes more candidates |
Why Top-p is Generally Preferred
Consider two scenarios at different generation steps:
Step A (peaked distribution): Model is 95% sure the next word is “Paris”
- Top-k=50: Considers 50 tokens (49 are noise)
- Top-p=0.95: Considers only 1-2 tokens (adaptive!)
Step B (flat distribution): Model is uncertain, many tokens are equally likely
- Top-k=50: Might miss some reasonable candidates if vocabulary is large
- Top-p=0.95: Includes all tokens until 95% mass is covered (could be 100+ tokens)
Combining Top-k and Top-p
In practice, many systems use both simultaneously:
- First apply Top-k to limit to k candidates
- Then apply Top-p within those k candidates
This provides both an upper bound (Top-k) and adaptive filtering (Top-p).
Recommended Settings
| Task | top_k | top_p | Rationale |
|---|---|---|---|
| Deterministic (code, facts) | 1 | 1.0 | Equivalent to greedy |
| Balanced (chat) | 40-50 | 0.9 | Diverse but coherent |
| Creative (stories) | 100+ | 0.95 | Wide exploration |
| Structured output (JSON) | 5-10 | 0.8 | Limited, safe choices |
Q4: What is the context window and how does it constrain LLM behavior?
Answer:
The context window (also called context length or maximum sequence length) is the total number of tokens an LLM can process in a single inference call — this includes both input tokens and output tokens.
\text{Context Window} = \text{Input Tokens (prompt)} + \text{Output Tokens (completion)}
graph LR
subgraph CW["Context Window (e.g., 128K tokens)"]
direction LR
SYS["System Prompt<br/>(500 tokens)"]
CTX["Retrieved Context / RAG<br/>(10,000 tokens)"]
HIST["Conversation History<br/>(5,000 tokens)"]
USER["User Message<br/>(200 tokens)"]
RESP["Model Response<br/>(max_tokens: 4,096)"]
end
style CW fill:#56cc9d,stroke:#333,color:#fff
style RESP fill:#ffce67,stroke:#333
Context Window Sizes (2024–2026)
| Model | Context Window | Notes |
|---|---|---|
| GPT-3.5 Turbo | 16K tokens | ~12K words |
| GPT-4 | 128K tokens | ~96K words |
| GPT-4o | 128K tokens | ~96K words |
| Claude 3.5 Sonnet | 200K tokens | ~150K words |
| Gemini 1.5 Pro | 1M–2M tokens | Longest available |
| LLaMA 3.1 | 128K tokens | Open-source |
| Mistral Large | 128K tokens | |
| DeepSeek-V3 | 128K tokens |
What Happens When You Exceed the Context Window?
| Behavior | Description |
|---|---|
| Truncation | Oldest tokens are dropped (APIs return error or truncate) |
| Error | API rejects the request if input exceeds limit |
| Degraded performance | Even within limits, performance drops in the “middle” |
Context Window vs. Effective Context
Key insight for interviews: The advertised context window is not the same as effective context:
| Concept | Meaning |
|---|---|
| Maximum context | Technical limit the model supports |
| Effective context | Length at which performance remains high |
| “Lost in the middle” | Information in the center of long contexts is often missed |
| Needle-in-a-haystack | Benchmark: can the model find a fact placed at position X? |
Strategies for Context Window Management
| Strategy | How it works |
|---|---|
| Chunking + RAG | Only retrieve relevant chunks, don’t stuff everything |
| Summarization | Compress conversation history into summaries |
| Sliding window | Keep recent messages + system prompt, drop old middle |
| Hierarchical context | Summary of old messages + full recent messages |
| Prompt compression | Use tools like LLMLingua to compress prompts |
Cost Implications
Context window directly affects cost:
\text{Cost} = (\text{Input tokens} \times \text{price/input token}) + (\text{Output tokens} \times \text{price/output token})
Longer contexts mean higher costs, higher latency, and more KV-cache memory usage.
Q5: Is LLM generation deterministic? How do you achieve reproducible outputs?
Answer:
By default, LLM generation is non-deterministic — the same prompt can produce different outputs across calls. This is intentional but can be controlled.
graph TD
subgraph NonDet["Non-Deterministic (Default)"]
ND1["Same prompt"]
ND2["Run 1: 'The capital is Paris.'"]
ND3["Run 2: 'Paris is the capital of France.'"]
ND4["Run 3: 'France's capital city is Paris.'"]
ND1 --> ND2
ND1 --> ND3
ND1 --> ND4
end
subgraph Det["Deterministic (Configured)"]
D1["Same prompt + seed + temp=0"]
D2["Run 1: 'The capital is Paris.'"]
D3["Run 2: 'The capital is Paris.'"]
D4["Run 3: 'The capital is Paris.'"]
D1 --> D2
D1 --> D3
D1 --> D4
end
style NonDet fill:#ffce67,stroke:#333
style Det fill:#56cc9d,stroke:#333,color:#fff
Sources of Non-Determinism
| Source | Explanation | Controllable? |
|---|---|---|
| Sampling (temperature > 0) | Random token selection from distribution | Yes — set temperature=0 |
| Top-p / Top-k filtering | Random selection within candidate set | Yes — set top_p=1, top_k=1 |
| Floating-point non-determinism | GPU parallel operations not strictly ordered | Partially — depends on hardware |
| Batching effects | Different batch compositions may affect computation | No (server-side) |
| Model updates | Provider may update model without notice | No (use versioned models) |
| System prompt caching | Some providers cache and may route differently | No |
How to Achieve Deterministic Output
| Method | What it does | Guarantee Level |
|---|---|---|
temperature=0 |
Greedy decoding (argmax) | High — nearly deterministic |
seed parameter |
Fixes random state for sampling | High (API-dependent) |
temperature=0 + seed |
Both greedy and fixed state | Highest available |
| Self-hosted + fixed seed + deterministic CUDA | Full control over hardware | True determinism |
When Determinism Matters
| Use Case | Need Deterministic? | Why |
|---|---|---|
| Unit testing | Yes | Reproducible test assertions |
| Evaluation/benchmarks | Yes | Fair comparison across models |
| Caching | Yes | Same input → cache hit |
| Audit/compliance | Yes | Reproducible decisions |
| Creative writing | No | Variety is desired |
| Chat conversations | No | Natural variation is expected |
Important Caveat
Even with temperature=0 and a seed, exact determinism is not always guaranteed:
- GPU floating-point operations may vary across hardware versions
- API providers may route requests to different hardware
- Model quantization can introduce slight variations
- OpenAI states: “deterministic outputs are not guaranteed” even with seed (but are “mostly deterministic”)
Q6: What are the main decoding strategies and when should you use each?
Answer:
Decoding is the process of selecting which token to generate next given the probability distribution from the model. The choice of decoding strategy dramatically affects output quality.
graph TD
DECODE["Decoding Strategies"]
DECODE --> DETERM["Deterministic"]
DECODE --> STOCH["Stochastic (Sampling)"]
DECODE --> HYBRID["Hybrid / Advanced"]
DETERM --> GREEDY["Greedy Search<br/>Pick argmax at each step"]
DETERM --> BEAM["Beam Search<br/>Track top-n hypotheses"]
STOCH --> PURE["Pure Sampling<br/>Sample from full distribution"]
STOCH --> TOPK["Top-k Sampling<br/>Sample from top k tokens"]
STOCH --> TOPP["Top-p Sampling<br/>Sample from nucleus"]
STOCH --> TEMP_SAMP["Temperature Sampling<br/>Reshape distribution then sample"]
HYBRID --> SPEC["Speculative Decoding<br/>Draft + verify"]
HYBRID --> CONTRAST["Contrastive Decoding<br/>Subtract weak model's distribution"]
HYBRID --> GUIDED["Guided/Constrained<br/>Enforce output structure"]
style DECODE fill:#56cc9d,stroke:#333,color:#fff
style DETERM fill:#6cc3d5,stroke:#333,color:#fff
style STOCH fill:#ffce67,stroke:#333
style HYBRID fill:#ff7851,stroke:#333,color:#fff
Detailed Comparison
| Strategy | How it works | Pros | Cons |
|---|---|---|---|
| Greedy | Always pick highest probability token | Fast, deterministic, simple | Repetitive, misses better sequences |
| Beam Search | Track top-n partial sequences | Finds higher-probability sequences | Still repetitive, expensive, poor for open-ended |
| Top-k Sampling | Sample from top k tokens | Reduces nonsense, some diversity | Fixed k not adaptive to distribution |
| Top-p Sampling | Sample from smallest set covering p mass | Adaptive to uncertainty, natural | Slightly less predictable |
| Temperature + Sampling | Reshape distribution then sample | Fine-grained control | Need to tune parameter |
| Speculative Decoding | Small model drafts, large model verifies | 2-3x faster, same quality | Needs draft model |
| Contrastive Decoding | Subtract amateur model’s preferences | Reduces repetition, more coherent | Complex setup |
| Constrained Decoding | Force output to follow grammar/schema | Guarantees valid structure | Limits expressiveness |
Greedy Search: The Simplest Strategy
At each step, pick the token with the highest probability:
w_t = \arg\max_{w} P(w | w_{1:t-1})
Problem: Greedy search is locally optimal but not globally optimal. A low-probability token now might lead to a much better overall sequence.
Example: “The dog” (0.4) → “has” (0.9) gives sequence probability 0.36, while “The nice” (0.5) → “woman” (0.4) gives 0.20. Greedy picks “nice” first but misses the better path.
Beam Search: Exploring Multiple Paths
Maintains num_beams parallel hypotheses:
Beam 1: "The" → "dog" → "has" → "a" (prob: 0.36 × ...)
Beam 2: "The" → "nice" → "woman" → "is" (prob: 0.20 × ...)
Beam 3: "The" → "cat" → "sat" → "on" (prob: 0.15 × ...)
When to use beam search:
- Translation (known output length)
- Summarization (structured output)
- NOT for open-ended generation (causes repetition)
When to Use Which Strategy
| Task | Recommended Strategy | Why |
|---|---|---|
| Code generation | Greedy (temp=0) | Correctness over creativity |
| Translation | Beam search (beams=4-5) | Quality over diversity |
| Creative writing | Top-p=0.95, temp=0.8 | Diversity and surprise |
| Chat/conversation | Top-p=0.9, temp=0.7 | Natural but coherent |
| Structured extraction | Constrained decoding | Must follow schema |
| JSON output | Greedy + grammar constraints | Validity guaranteed |
| Fast inference | Speculative decoding | Speed with no quality loss |
Q7: What are frequency penalty and presence penalty, and how do they reduce repetition?
Answer:
Frequency penalty and presence penalty are post-processing adjustments to token logits that discourage the model from repeating itself.
Mathematical Definitions
The logit for token i is adjusted before sampling:
z_i' = z_i - (\text{frequency\_penalty} \times \text{count}(i)) - (\text{presence\_penalty} \times \mathbb{1}[\text{count}(i) > 0])
where \text{count}(i) is how many times token i has appeared in the output so far.
graph TD
subgraph FP["Frequency Penalty"]
FP1["Penalizes proportionally to<br/>how MANY times token appeared"]
FP2["Token appeared 5× → big penalty"]
FP3["Token appeared 1× → small penalty"]
FP4["Effect: Reduces repetitive words"]
end
subgraph PP["Presence Penalty"]
PP1["Penalizes equally if token<br/>appeared AT ALL (binary)"]
PP2["Token appeared 5× → same penalty as 1×"]
PP3["Token never appeared → no penalty"]
PP4["Effect: Encourages new topics"]
end
style FP fill:#6cc3d5,stroke:#333,color:#fff
style PP fill:#ffce67,stroke:#333
Comparison
| Aspect | Frequency Penalty | Presence Penalty |
|---|---|---|
| Scales with count? | Yes (proportional) | No (binary: appeared or not) |
| Range (OpenAI) | -2.0 to 2.0 | -2.0 to 2.0 |
| Primary effect | Reduces word-level repetition | Encourages topic diversity |
| Use case | Avoid saying “very very very…” | Avoid staying on same topic |
| Analogy | “Don’t repeat words” | “Talk about new things” |
Practical Examples
Without penalties (both = 0): > “The weather is nice. The weather is really nice. The weather makes me happy. The weather…”
With frequency_penalty = 0.8: > “The weather is nice. It’s a beautiful day. The sunshine makes me happy. I think I’ll go outside…”
With presence_penalty = 1.0: > “The weather is nice. I’ve been reading a great book lately. My garden is blooming. Tomorrow I plan to cook…”
Repetition Penalty (Hugging Face)
Hugging Face uses a multiplicative repetition_penalty instead:
z_i' = \begin{cases} z_i / \text{repetition\_penalty} & \text{if } z_i > 0 \text{ and token appeared} \\ z_i \times \text{repetition\_penalty} & \text{if } z_i < 0 \text{ and token appeared} \end{cases}
repetition_penalty = 1.0: No effectrepetition_penalty = 1.2: Moderate de-repetition (common default)repetition_penalty > 1.5: Strong — may cause incoherence
Q8: What is max_tokens and how does it interact with the context window?
Answer:
max_tokens (or max_new_tokens) sets the maximum number of tokens the model will generate in its response. It’s a hard cap — generation stops even if the response is incomplete.
graph LR
subgraph Budget["Token Budget Allocation"]
direction LR
CW["Context Window: 128K"]
INPUT["Input tokens used: 50K"]
AVAILABLE["Available for output: 78K"]
MAX["max_tokens set: 4096"]
ACTUAL["Actual output: min(4096, until EOS)"]
end
CW --> INPUT --> AVAILABLE --> MAX --> ACTUAL
style Budget fill:#56cc9d,stroke:#333,color:#fff
Key Relationships
\text{max\_tokens} \leq \text{context\_window} - \text{input\_tokens}
If you set max_tokens higher than available space, the API will either:
- Silently cap it at the available space
- Return an error
max_tokens vs. max_new_tokens
| Parameter | Framework | What it means |
|---|---|---|
max_tokens |
OpenAI, Anthropic APIs | Max tokens in the completion |
max_new_tokens |
Hugging Face Transformers | Max new tokens to generate (same concept) |
max_length |
Hugging Face (older) | Max total length (input + output) |
Why Generation Stops
Generation terminates when any of these conditions is met:
| Condition | Description |
|---|---|
max_tokens reached |
Hard output length limit |
| EOS token generated | Model naturally finishes its response |
| Stop sequence matched | A specified string pattern is found |
| Context window full | Input + output fills the entire window |
Practical Implications
| Setting | Effect | Risk |
|---|---|---|
| Too low (e.g., 50) | Responses get cut off mid-sentence | Incomplete, incoherent outputs |
| Too high (e.g., 16384) | Model can write as much as it wants | Higher cost, potential rambling |
| Right-sized | Complete responses without waste | Requires knowing task needs |
Cost Optimization
Since APIs charge per token:
- Set
max_tokensappropriate to the task (not arbitrarily high) - Use
stopsequences to terminate early - Monitor actual token usage vs. max_tokens budget
Q9: What are stop sequences and how do they control generation?
Answer:
Stop sequences are strings that, when generated by the model, immediately terminate generation. They’re a powerful mechanism for controlling output format and length.
graph TD
GEN["Model Generating..."]
GEN --> CHECK{"Generated text<br/>contains stop sequence?"}
CHECK -->|"No"| CONT["Continue generating"]
CONT --> GEN
CHECK -->|"Yes"| STOP["Stop immediately<br/>Return output (stop seq excluded)"]
style GEN fill:#6cc3d5,stroke:#333,color:#fff
style STOP fill:#56cc9d,stroke:#333,color:#fff
style CHECK fill:#ffce67,stroke:#333
Common Stop Sequence Use Cases
| Use Case | Stop Sequences | Purpose |
|---|---|---|
| Single-line answer | ["\n"] |
Prevent multi-line responses |
| Code function | ["\n\n", "def ", "class "] |
Stop after one function |
| Structured QA | ["Q:", "Question:"] |
Stop before generating next question |
| Chat role-play | ["User:", "Human:"] |
Prevent model from simulating user |
| JSON extraction | ["}"] or ["}\n"] |
Stop after closing brace |
| Numbered list | ["11."] |
Limit to 10 items |
Example: Controlling Multi-Turn Chat
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "List 3 fruits"}],
stop=["\n\n", "4."], # Stop after 3 items
max_tokens=200
)Without stop sequences: Model might continue listing dozens of fruits or add commentary.
With stop sequences: Generation stops cleanly after the third item.
Stop Sequences vs. Other Stopping Mechanisms
| Mechanism | How it works | Granularity |
|---|---|---|
| Stop sequences | Match specific text strings | Fine (exact strings) |
| max_tokens | Hard token count limit | Coarse (may cut mid-word) |
| EOS token | Model decides it’s done | Model-controlled |
| Constrained decoding | Grammar forces valid endings | Structural |
Best Practices
- Include the delimiter that separates outputs (e.g.,
"\n\n"between paragraphs) - Test with variations — models might generate
"\n "instead of"\n\n" - Combine with max_tokens as a safety net
- Don’t over-specify — too many stop sequences can cause premature truncation
Q10: How do you choose the right configuration for different LLM tasks?
Answer:
Choosing the right parameters is about matching the creativity-accuracy tradeoff to your specific task requirements.
graph LR
subgraph Spectrum["Creativity ↔ Accuracy Spectrum"]
direction LR
DET["🎯 Deterministic<br/>temp=0, top_p=1"]
CON["🔒 Conservative<br/>temp=0.2, top_p=0.9"]
BAL["⚖️ Balanced<br/>temp=0.7, top_p=0.9"]
CRE["🎨 Creative<br/>temp=1.0, top_p=0.95"]
WILD["🌀 Wild<br/>temp=1.5, top_p=1.0"]
end
DET --> CON --> BAL --> CRE --> WILD
style DET fill:#56cc9d,stroke:#333,color:#fff
style CON fill:#6cc3d5,stroke:#333,color:#fff
style BAL fill:#ffce67,stroke:#333
style CRE fill:#ff7851,stroke:#333,color:#fff
Decision Framework
| Question | If Yes → | If No → |
|---|---|---|
| Does output need to be exactly correct? | temp=0, greedy | Consider sampling |
| Is creativity/variety valued? | temp=0.7-1.0 | temp=0-0.3 |
| Must output follow strict format? | Constrained decoding, low temp | Higher freedom |
| Running evaluations/benchmarks? | temp=0, seed set | Doesn’t matter |
| Is this user-facing chat? | temp=0.7, penalties for variety | Task-dependent |
| Generating multiple candidates? | Higher temp, n>1 | Standard settings |
Complete Configuration Recipes
Recipe 1: Code Generation
temperature: 0.0
top_p: 1.0
max_tokens: 2048
stop: ["\n\n\n", "```"]
frequency_penalty: 0.0
Why: Code requires precision. Any “creativity” means bugs.
Recipe 2: Customer Support Bot
temperature: 0.3
top_p: 0.9
max_tokens: 512
presence_penalty: 0.2
stop: ["Human:", "Customer:"]
Why: Slightly varied but consistent, professional responses.
Recipe 3: Creative Story Writing
temperature: 0.9
top_p: 0.95
max_tokens: 4096
frequency_penalty: 0.7
presence_penalty: 0.5
Why: Maximum variety, avoids repetition, explores narrative directions.
Recipe 4: Data Extraction (JSON)
temperature: 0.0
top_p: 1.0
max_tokens: 256
response_format: {"type": "json_object"}
stop: ["}\n"]
Why: Must produce valid, consistent structured output.
Recipe 5: Brainstorming / Ideation
temperature: 1.2
top_p: 0.95
max_tokens: 1024
frequency_penalty: 1.0
presence_penalty: 1.5
n: 5
Why: Generate diverse ideas; high penalties force exploration of new territory.
Common Mistakes
| Mistake | Problem | Fix |
|---|---|---|
temperature=0 for creative tasks |
Bland, repetitive output | Increase to 0.7-1.0 |
temperature=1.0 for factual tasks |
Hallucinations, wrong facts | Decrease to 0-0.3 |
Ignoring max_tokens |
Unexpected costs, truncation | Always set appropriate limit |
Setting both temperature and top_p low |
Over-constrained, degenerate | Usually modify one, keep other default |
| No stop sequences in agentic loops | Model generates beyond intended boundary | Add role/delimiter stops |
Summary Table
| # | Topic | Key Concept |
|---|---|---|
| 1 | API Parameters | temperature, top_p, max_tokens, penalties, stop sequences |
| 2 | Temperature | Controls distribution sharpness: 0=greedy, 1=natural, >1=chaotic |
| 3 | Top-p vs. Top-k | Fixed-size (k) vs. adaptive probability mass (p) filtering |
| 4 | Context Window | Total input+output token budget; affects cost, latency, quality |
| 5 | Determinism | temp=0 + seed for reproducibility; true determinism is hard |
| 6 | Decoding Strategies | Greedy, beam search, sampling, speculative, constrained |
| 7 | Penalties | frequency_penalty (proportional) vs. presence_penalty (binary) |
| 8 | max_tokens | Hard output cap; interacts with context window budget |
| 9 | Stop Sequences | String patterns that terminate generation cleanly |
| 10 | Configuration Recipes | Match creativity-accuracy tradeoff to task requirements |
What’s Next?
This article covered the practical configuration knowledge tested in LLM engineering interviews. For related content:
- Core LLM concepts (transformers, RAG, RLHF): LLM Interview QA - 1
- Advanced topics (scaling, agents, inference): LLM Interview QA - 2
- ML fundamentals: ML Interview QA - 1 and ML Interview QA - 2