Decoding Methods for Text Generation with LLMs

A hands-on comparison of greedy search, beam search, sampling, top-k, top-p, and contrastive search using small pretrained models

Open In Colab

📖 Read the full article

Setup & Installation
Load Model
Greedy Search
Beam Search
Pure Sampling
Temperature Scaling
Top-K Sampling
Top-p (Nucleus) Sampling
Contrastive Search
Full Comparison

1. Setup & Installation

Install the required packages for text generation with Hugging Face Transformers.

!pip install -q transformers torch

2. Load Model

We use GPT-2 (124M parameters) — small enough to run on CPU, yet large enough to demonstrate clear differences between decoding strategies.

Auto-regressive generation produces one token at a time:

P(w_{1:T} | W_0) = \prod_{t=1}^{T} P(w_t | w_{1:t-1}, W_0)

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(
    "gpt2", pad_token_id=tokenizer.eos_token_id
).to(device)

prompt = "Artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
print(f"Model loaded on {device}")

3. Greedy Search

The simplest strategy — at each step, select the token with the highest probability:

w_t = \arg\max_w P(w | w_{1:t-1})

Pros: Fast and deterministic. Good for short, factual outputs.

Cons: Quickly falls into repetitive loops. Misses high-probability sequences hidden behind lower-probability initial tokens.

output = model.generate(**inputs, max_new_tokens=60)
print("[Greedy Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

4. Beam Search

Beam search keeps track of the top n most likely partial sequences (beams) at each step, then selects the sequence with the highest overall probability.

Parameter	Description
`num_beams`	Number of beams to track (higher = more exploration, slower)
`no_repeat_ngram_size`	Prevents any n-gram from appearing twice
`early_stopping`	Stops when all beams reach the EOS token

output = model.generate(
    **inputs,
    max_new_tokens=60,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)
print("[Beam Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

5. Pure Sampling

Sampling randomly picks the next token according to its probability distribution:

w_t \sim P(w | w_{1:t-1})

This introduces randomness, breaking the repetition patterns of deterministic methods.

Pros: Eliminates repetition, produces diverse outputs.

Cons: Can produce incoherent or nonsensical text.

set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_k=0  # disable top-k to use full vocabulary
)
print("[Pure Sampling]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

6. Temperature Scaling

Temperature \tau reshapes the probability distribution before sampling:

P(w_i) = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}

Temperature	Effect
\tau < 1	Sharpens the distribution — more deterministic
\tau = 1	No change — original distribution
\tau > 1	Flattens the distribution — more random
\tau \to 0	Equivalent to greedy search

set_seed(42)
# Low temperature → more focused
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    temperature=0.3,
    top_k=0
)
print("[Low temp (0.3)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

set_seed(42)
# High temperature → more creative
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    temperature=1.5,
    top_k=0
)
print("\n[High temp (1.5)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

7. Top-K Sampling

Top-K sampling (Fan et al., 2018) filters the vocabulary to only the K most likely tokens, then redistributes the probability mass among them.

Pros: Eliminates nonsensical low-probability tokens. GPT-2 used top-k=40.

Cons: Fixed K does not adapt to the shape of the distribution.

set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_k=50
)
print("[Top-K Sampling (k=50)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

8. Top-p (Nucleus) Sampling

Top-p sampling (Holtzman et al., 2019) dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p.

Key advantage over top-K: Adapts the candidate set size dynamically. When the model is confident, few tokens suffice. When uncertain, more tokens are included.

Combining top-k and top-p is common practice: top-k removes the long tail, then top-p refines dynamically.

set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_p=0.92,
    top_k=0  # disable top-k to let top-p work alone
)
print("[Top-p Sampling (p=0.92)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# Combined: top-k + top-p + temperature
set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.8
)
print("[Combined: top-k=50, top-p=0.95, temp=0.8]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

9. Contrastive Search

Contrastive search (Su et al., NeurIPS 2022) is a deterministic method that produces human-level text without repetition or incoherence.

At each step, it selects the token that maximizes:

(1 - \alpha) \times \underbrace{P(v | x_{<t})}_{\text{model confidence}} - \alpha \times \underbrace{\max_{j < t} \cos(h_v, h_{x_j})}_{\text{degeneration penalty}}

Parameter	Description
`penalty_alpha`	The α hyperparameter (typically 0.5–0.6)
`top_k`	Number of candidate tokens at each step (typically 4–10)

output = model.generate(
    **inputs,
    max_new_tokens=60,
    penalty_alpha=0.6,
    top_k=4
)
print("[Contrastive Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

10. Full Comparison

Generate text using all six methods for side-by-side comparison.

Method	Deterministic	Repetition	Coherence	Diversity	Speed
Greedy Search	Yes	High	Medium	Low	Fast
Beam Search	Yes	Medium*	Medium-High	Low	Medium
Pure Sampling	No	Low	Low	High	Fast
Top-K Sampling	No	Low	Medium-High	Medium-High	Fast
Top-p Sampling	No	Low	High	Medium-High	Fast
Contrastive Search	Yes	Low	Very High	Medium	Slow

prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
max_tokens = 80

print("=" * 70)
print("PROMPT:", prompt)
print("=" * 70)

# 1. Greedy Search
output = model.generate(**inputs, max_new_tokens=max_tokens)
print("\n[Greedy Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 2. Beam Search
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    num_beams=5, no_repeat_ngram_size=2, early_stopping=True
)
print("\n[Beam Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 3. Pure Sampling
set_seed(42)
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    do_sample=True, top_k=0
)
print("\n[Pure Sampling]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 4. Top-K Sampling
set_seed(42)
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    do_sample=True, top_k=50
)
print("\n[Top-K Sampling (k=50)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 5. Top-p Sampling
set_seed(42)
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    do_sample=True, top_p=0.92, top_k=0
)
print("\n[Top-p Sampling (p=0.92)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 6. Contrastive Search
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    penalty_alpha=0.6, top_k=4
)
print("\n[Contrastive Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

Table of Contents