Post-Training LLMs for Human Alignment

A practical guide to SFT, DPO, KTO, ORPO, and GRPO — aligning language models with human preferences using the TRL library

Open In Colab

📖 Read the full article


Table of Contents

  1. Setup & Installation
  2. SFT (Supervised Fine-Tuning)
  3. DPO (Direct Preference Optimization)
  4. KTO (Kahneman-Tversky Optimization)
  5. ORPO (Odds Ratio Preference Optimization)
  6. GRPO (Group Relative Policy Optimization)
  7. Method Comparison
  8. Practical Recommendations

1. Setup & Installation

Install the required packages for post-training alignment methods.

!pip install -q trl transformers datasets peft accelerate bitsandbytes

2. SFT (Supervised Fine-Tuning)

SFT is the first step in post-training alignment. It teaches the model to follow instructions by training on high-quality (instruction, response) pairs.

Aspect Detail
Goal Teach the model to follow instructions
Data (instruction, response) pairs
Loss Standard cross-entropy on response tokens
Models in memory 1 (policy model)
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# Load a high-quality instruction dataset
dataset = load_dataset("trl-lib/Capybara", split="train")

# Configure SFT training
sft_config = SFTConfig(
    output_dir="./sft_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-5,
    logging_steps=10,
    max_seq_length=1024,
    fp16=True,
)

# Initialize SFT trainer
trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    args=sft_config,
    train_dataset=dataset,
)

# Start training
# trainer.train()

3. DPO (Direct Preference Optimization)

DPO skips the reward model entirely and directly optimizes the policy using preference pairs (chosen vs. rejected).

DPO Loss Formula

The DPO objective reparameterizes the RLHF reward function to derive a loss that depends only on the policy:

\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right) \right]

Where: - \pi_\theta is the policy being trained - \pi_{\text{ref}} is the frozen reference model - y_w is the preferred (chosen) response - y_l is the rejected response - \beta controls how far the policy can deviate from the reference

Aspect Detail
Goal Align outputs to human preferences without a reward model
Data (prompt, chosen, rejected) triples
Models in memory 2 (policy + frozen reference)
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig

# Load preference dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

# LoRA configuration to reduce memory
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# DPO training configuration
dpo_config = DPOConfig(
    output_dir="./dpo_output",
    beta=0.1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-5,
    logging_steps=10,
    max_length=1024,
    max_prompt_length=512,
    fp16=True,
)

# Initialize DPO trainer
trainer = DPOTrainer(
    model="Qwen/Qwen2.5-0.5B",
    args=dpo_config,
    train_dataset=dataset,
    peft_config=peft_config,
)

# Start training
# trainer.train()

4. KTO (Kahneman-Tversky Optimization)

KTO works with unpaired feedback — each response is independently labeled as 👍 or 👎, without requiring a corresponding preferred/rejected pair.

This is inspired by Kahneman & Tversky’s Prospect Theory: humans feel losses more strongly than equivalent gains. KTO applies different loss weights to desirable and undesirable outputs.

When to Use KTO

Scenario Use KTO? Why
Only thumbs-up/down ratings available ✅ Yes KTO is designed for unpaired binary feedback
Responses are independently scored ✅ Yes No need to construct artificial pairs
You have paired preferences (chosen vs rejected) ❌ Use DPO DPO uses richer signal from direct comparisons
Collecting preference data is expensive ✅ Yes Binary feedback is cheaper to collect
You need maximum alignment quality ❌ Use DPO/RLHF Paired signals provide more information
from trl import KTOTrainer, KTOConfig

# KTO expects a dataset with columns: prompt, completion, label (True/False)
# Example dataset structure:
# {"prompt": "Explain gravity", "completion": "Gravity is...", "label": True}
# {"prompt": "Explain gravity", "completion": "I don't know", "label": False}

kto_config = KTOConfig(
    output_dir="./kto_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-5,
    logging_steps=10,
    max_length=1024,
    max_prompt_length=512,
    fp16=True,
)

# Initialize KTO trainer
# trainer = KTOTrainer(
#     model="Qwen/Qwen2.5-0.5B",
#     args=kto_config,
#     train_dataset=kto_dataset,
# )

# trainer.train()

5. ORPO (Odds Ratio Preference Optimization)

ORPO combines SFT and preference alignment into a single training stage. It adds an odds-ratio-based penalty to the standard SFT loss, eliminating the need for a separate reference model.

Aspect Detail
Goal SFT + alignment in one pass
Data (prompt, chosen, rejected) triples
Models in memory 1 (no reference model needed)
Key advantage Simpler pipeline, lower memory
from trl import ORPOTrainer, ORPOConfig

# Load preference dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

# ORPO configuration
orpo_config = ORPOConfig(
    output_dir="./orpo_output",
    beta=0.1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-5,
    logging_steps=10,
    max_length=1024,
    max_prompt_length=512,
    fp16=True,
)

# Initialize ORPO trainer
trainer = ORPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    args=orpo_config,
    train_dataset=dataset,
)

# Start training
# trainer.train()

6. GRPO (Group Relative Policy Optimization)

GRPO (used in DeepSeek-R1) replaces the critic model with group-level scoring. For each prompt, it generates multiple completions, scores them with reward functions, and normalizes scores within the group.

Aspect Detail
Goal RL-based alignment without a critic model
Data Prompts only — generations and rewards computed online
Models in memory 1 (policy model)
Key advantage Custom reward functions, no paired data needed
from trl import GRPOTrainer, GRPOConfig
import re

# Load a math reasoning dataset
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")


# Custom reward: check if the answer matches the expected format
def format_reward(completions, **kwargs):
    """Reward for following the expected <answer>...</answer> format."""
    pattern = r"<answer>.*?</answer>"
    return [1.0 if re.search(pattern, c) else 0.0 for c in completions]


# Custom reward: prefer concise responses
def length_reward(completions, **kwargs):
    """Reward shorter completions (normalized by max reasonable length)."""
    max_len = 500
    return [max(0, 1.0 - len(c) / max_len) for c in completions]


# GRPO configuration
grpo_config = GRPOConfig(
    output_dir="./grpo_output",
    num_generations=8,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-6,
    logging_steps=10,
    max_completion_length=512,
    fp16=True,
)

# Initialize GRPO trainer with custom reward functions
trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B",
    args=grpo_config,
    train_dataset=dataset,
    reward_funcs=[format_reward, length_reward],
)

# Start training
# trainer.train()

7. Method Comparison

Summary of Post-Training Alignment Methods

Method Type Models in Memory Data Requirement Key Library
SFT Supervised 1 (policy) (instruction, response) pairs SFTTrainer
DPO Preference 2 (policy + reference) (prompt, chosen, rejected) triples DPOTrainer
KTO Preference 2 (policy + reference) (prompt, completion, label) — unpaired KTOTrainer
ORPO Hybrid (SFT + Pref) 1 (no reference) (prompt, chosen, rejected) triples ORPOTrainer
GRPO Reinforcement Learning 1 (policy) Prompts only + reward functions GRPOTrainer

Key Trade-offs

  • SFT is the foundation — always start here
  • DPO gives the strongest preference alignment but requires paired data and 2x memory
  • KTO is the best option when you only have thumbs-up/down feedback
  • ORPO saves a training stage by combining SFT and alignment
  • GRPO is the most flexible — define any reward function, no paired data needed

8. Practical Recommendations

When to Use Each Method

  1. Start with SFT — Always fine-tune on high-quality instruction data first
  2. Use DPO if you have paired preference data and enough GPU memory for two models
  3. Use KTO if you only have binary (good/bad) labels without pairing
  4. Use ORPO if you want a simpler single-stage pipeline
  5. Use GRPO if you need custom reward functions or verifiable outputs (math, code)

Memory Optimization with LoRA / QLoRA

All methods support LoRA and QLoRA to dramatically reduce memory requirements:

from peft import LoraConfig
from transformers import BitsAndBytesConfig
import torch

# LoRA config — adds <1% trainable parameters
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# QLoRA config — 4-bit quantized base model + LoRA adapters
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

print("LoRA config:", peft_config)
print("\nQLoRA quantization config:", bnb_config)