!pip install -q transformers torchDecoding Methods for Text Generation with LLMs
A hands-on comparison of greedy search, beam search, sampling, top-k, top-p, and contrastive search using small pretrained models
Table of Contents
1. Setup & Installation
Install the required packages for text generation with Hugging Face Transformers.
2. Load Model
We use GPT-2 (124M parameters) — small enough to run on CPU, yet large enough to demonstrate clear differences between decoding strategies.
Auto-regressive generation produces one token at a time:
P(w_{1:T} | W_0) = \prod_{t=1}^{T} P(w_t | w_{1:t-1}, W_0)
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(
"gpt2", pad_token_id=tokenizer.eos_token_id
).to(device)
prompt = "Artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
print(f"Model loaded on {device}")3. Greedy Search
The simplest strategy — at each step, select the token with the highest probability:
w_t = \arg\max_w P(w | w_{1:t-1})
Pros: Fast and deterministic. Good for short, factual outputs.
Cons: Quickly falls into repetitive loops. Misses high-probability sequences hidden behind lower-probability initial tokens.
output = model.generate(**inputs, max_new_tokens=60)
print("[Greedy Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))4. Beam Search
Beam search keeps track of the top n most likely partial sequences (beams) at each step, then selects the sequence with the highest overall probability.
| Parameter | Description |
|---|---|
num_beams |
Number of beams to track (higher = more exploration, slower) |
no_repeat_ngram_size |
Prevents any n-gram from appearing twice |
early_stopping |
Stops when all beams reach the EOS token |
output = model.generate(
**inputs,
max_new_tokens=60,
num_beams=5,
no_repeat_ngram_size=2,
early_stopping=True
)
print("[Beam Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))5. Pure Sampling
Sampling randomly picks the next token according to its probability distribution:
w_t \sim P(w | w_{1:t-1})
This introduces randomness, breaking the repetition patterns of deterministic methods.
Pros: Eliminates repetition, produces diverse outputs.
Cons: Can produce incoherent or nonsensical text.
set_seed(42)
output = model.generate(
**inputs,
max_new_tokens=60,
do_sample=True,
top_k=0 # disable top-k to use full vocabulary
)
print("[Pure Sampling]")
print(tokenizer.decode(output[0], skip_special_tokens=True))6. Temperature Scaling
Temperature \tau reshapes the probability distribution before sampling:
P(w_i) = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}
| Temperature | Effect |
|---|---|
| \tau < 1 | Sharpens the distribution — more deterministic |
| \tau = 1 | No change — original distribution |
| \tau > 1 | Flattens the distribution — more random |
| \tau \to 0 | Equivalent to greedy search |
set_seed(42)
# Low temperature → more focused
output = model.generate(
**inputs,
max_new_tokens=60,
do_sample=True,
temperature=0.3,
top_k=0
)
print("[Low temp (0.3)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))
set_seed(42)
# High temperature → more creative
output = model.generate(
**inputs,
max_new_tokens=60,
do_sample=True,
temperature=1.5,
top_k=0
)
print("\n[High temp (1.5)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))7. Top-K Sampling
Top-K sampling (Fan et al., 2018) filters the vocabulary to only the K most likely tokens, then redistributes the probability mass among them.
Pros: Eliminates nonsensical low-probability tokens. GPT-2 used top-k=40.
Cons: Fixed K does not adapt to the shape of the distribution.
set_seed(42)
output = model.generate(
**inputs,
max_new_tokens=60,
do_sample=True,
top_k=50
)
print("[Top-K Sampling (k=50)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))8. Top-p (Nucleus) Sampling
Top-p sampling (Holtzman et al., 2019) dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p.
Key advantage over top-K: Adapts the candidate set size dynamically. When the model is confident, few tokens suffice. When uncertain, more tokens are included.
Combining top-k and top-p is common practice: top-k removes the long tail, then top-p refines dynamically.
set_seed(42)
output = model.generate(
**inputs,
max_new_tokens=60,
do_sample=True,
top_p=0.92,
top_k=0 # disable top-k to let top-p work alone
)
print("[Top-p Sampling (p=0.92)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))# Combined: top-k + top-p + temperature
set_seed(42)
output = model.generate(
**inputs,
max_new_tokens=60,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.8
)
print("[Combined: top-k=50, top-p=0.95, temp=0.8]")
print(tokenizer.decode(output[0], skip_special_tokens=True))9. Contrastive Search
Contrastive search (Su et al., NeurIPS 2022) is a deterministic method that produces human-level text without repetition or incoherence.
At each step, it selects the token that maximizes:
(1 - \alpha) \times \underbrace{P(v | x_{<t})}_{\text{model confidence}} - \alpha \times \underbrace{\max_{j < t} \cos(h_v, h_{x_j})}_{\text{degeneration penalty}}
| Parameter | Description |
|---|---|
penalty_alpha |
The α hyperparameter (typically 0.5–0.6) |
top_k |
Number of candidate tokens at each step (typically 4–10) |
output = model.generate(
**inputs,
max_new_tokens=60,
penalty_alpha=0.6,
top_k=4
)
print("[Contrastive Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))10. Full Comparison
Generate text using all six methods for side-by-side comparison.
| Method | Deterministic | Repetition | Coherence | Diversity | Speed |
|---|---|---|---|---|---|
| Greedy Search | Yes | High | Medium | Low | Fast |
| Beam Search | Yes | Medium* | Medium-High | Low | Medium |
| Pure Sampling | No | Low | Low | High | Fast |
| Top-K Sampling | No | Low | Medium-High | Medium-High | Fast |
| Top-p Sampling | No | Low | High | Medium-High | Fast |
| Contrastive Search | Yes | Low | Very High | Medium | Slow |
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
max_tokens = 80
print("=" * 70)
print("PROMPT:", prompt)
print("=" * 70)
# 1. Greedy Search
output = model.generate(**inputs, max_new_tokens=max_tokens)
print("\n[Greedy Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))
# 2. Beam Search
output = model.generate(
**inputs, max_new_tokens=max_tokens,
num_beams=5, no_repeat_ngram_size=2, early_stopping=True
)
print("\n[Beam Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))
# 3. Pure Sampling
set_seed(42)
output = model.generate(
**inputs, max_new_tokens=max_tokens,
do_sample=True, top_k=0
)
print("\n[Pure Sampling]")
print(tokenizer.decode(output[0], skip_special_tokens=True))
# 4. Top-K Sampling
set_seed(42)
output = model.generate(
**inputs, max_new_tokens=max_tokens,
do_sample=True, top_k=50
)
print("\n[Top-K Sampling (k=50)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))
# 5. Top-p Sampling
set_seed(42)
output = model.generate(
**inputs, max_new_tokens=max_tokens,
do_sample=True, top_p=0.92, top_k=0
)
print("\n[Top-p Sampling (p=0.92)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))
# 6. Contrastive Search
output = model.generate(
**inputs, max_new_tokens=max_tokens,
penalty_alpha=0.6, top_k=4
)
print("\n[Contrastive Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))