Decoding Methods for Text Generation with LLMs

A hands-on comparison of greedy search, beam search, sampling, top-k, top-p, and contrastive search using small pretrained models

Open In Colab

📖 Read the full article


Table of Contents

  1. Setup & Installation
  2. Load Model
  3. Greedy Search
  4. Beam Search
  5. Pure Sampling
  6. Temperature Scaling
  7. Top-K Sampling
  8. Top-p (Nucleus) Sampling
  9. Contrastive Search
  10. Full Comparison

1. Setup & Installation

Install the required packages for text generation with Hugging Face Transformers.

!pip install -q transformers torch

2. Load Model

We use GPT-2 (124M parameters) — small enough to run on CPU, yet large enough to demonstrate clear differences between decoding strategies.

Auto-regressive generation produces one token at a time:

P(w_{1:T} | W_0) = \prod_{t=1}^{T} P(w_t | w_{1:t-1}, W_0)

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(
    "gpt2", pad_token_id=tokenizer.eos_token_id
).to(device)

prompt = "Artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
print(f"Model loaded on {device}")

5. Pure Sampling

Sampling randomly picks the next token according to its probability distribution:

w_t \sim P(w | w_{1:t-1})

This introduces randomness, breaking the repetition patterns of deterministic methods.

Pros: Eliminates repetition, produces diverse outputs.

Cons: Can produce incoherent or nonsensical text.

set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_k=0  # disable top-k to use full vocabulary
)
print("[Pure Sampling]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

6. Temperature Scaling

Temperature \tau reshapes the probability distribution before sampling:

P(w_i) = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}

Temperature Effect
\tau < 1 Sharpens the distribution — more deterministic
\tau = 1 No change — original distribution
\tau > 1 Flattens the distribution — more random
\tau \to 0 Equivalent to greedy search
set_seed(42)
# Low temperature → more focused
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    temperature=0.3,
    top_k=0
)
print("[Low temp (0.3)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

set_seed(42)
# High temperature → more creative
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    temperature=1.5,
    top_k=0
)
print("\n[High temp (1.5)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

7. Top-K Sampling

Top-K sampling (Fan et al., 2018) filters the vocabulary to only the K most likely tokens, then redistributes the probability mass among them.

Pros: Eliminates nonsensical low-probability tokens. GPT-2 used top-k=40.

Cons: Fixed K does not adapt to the shape of the distribution.

set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_k=50
)
print("[Top-K Sampling (k=50)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

8. Top-p (Nucleus) Sampling

Top-p sampling (Holtzman et al., 2019) dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p.

Key advantage over top-K: Adapts the candidate set size dynamically. When the model is confident, few tokens suffice. When uncertain, more tokens are included.

Combining top-k and top-p is common practice: top-k removes the long tail, then top-p refines dynamically.

set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_p=0.92,
    top_k=0  # disable top-k to let top-p work alone
)
print("[Top-p Sampling (p=0.92)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))
# Combined: top-k + top-p + temperature
set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.8
)
print("[Combined: top-k=50, top-p=0.95, temp=0.8]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

10. Full Comparison

Generate text using all six methods for side-by-side comparison.

Method Deterministic Repetition Coherence Diversity Speed
Greedy Search Yes High Medium Low Fast
Beam Search Yes Medium* Medium-High Low Medium
Pure Sampling No Low Low High Fast
Top-K Sampling No Low Medium-High Medium-High Fast
Top-p Sampling No Low High Medium-High Fast
Contrastive Search Yes Low Very High Medium Slow
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
max_tokens = 80

print("=" * 70)
print("PROMPT:", prompt)
print("=" * 70)

# 1. Greedy Search
output = model.generate(**inputs, max_new_tokens=max_tokens)
print("\n[Greedy Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 2. Beam Search
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    num_beams=5, no_repeat_ngram_size=2, early_stopping=True
)
print("\n[Beam Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 3. Pure Sampling
set_seed(42)
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    do_sample=True, top_k=0
)
print("\n[Pure Sampling]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 4. Top-K Sampling
set_seed(42)
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    do_sample=True, top_k=50
)
print("\n[Top-K Sampling (k=50)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 5. Top-p Sampling
set_seed(42)
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    do_sample=True, top_p=0.92, top_k=0
)
print("\n[Top-p Sampling (p=0.92)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 6. Contrastive Search
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    penalty_alpha=0.6, top_k=4
)
print("\n[Contrastive Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))