!pip install -q trl transformers datasets peft accelerate bitsandbytesPost-Training LLMs for Human Alignment
A practical guide to SFT, DPO, KTO, ORPO, and GRPO — aligning language models with human preferences using the TRL library
Table of Contents
1. Setup & Installation
Install the required packages for post-training alignment methods.
2. SFT (Supervised Fine-Tuning)
SFT is the first step in post-training alignment. It teaches the model to follow instructions by training on high-quality (instruction, response) pairs.
| Aspect | Detail |
|---|---|
| Goal | Teach the model to follow instructions |
| Data | (instruction, response) pairs |
| Loss | Standard cross-entropy on response tokens |
| Models in memory | 1 (policy model) |
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Load a high-quality instruction dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
# Configure SFT training
sft_config = SFTConfig(
output_dir="./sft_output",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-5,
logging_steps=10,
max_seq_length=1024,
fp16=True,
)
# Initialize SFT trainer
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
args=sft_config,
train_dataset=dataset,
)
# Start training
# trainer.train()3. DPO (Direct Preference Optimization)
DPO skips the reward model entirely and directly optimizes the policy using preference pairs (chosen vs. rejected).
DPO Loss Formula
The DPO objective reparameterizes the RLHF reward function to derive a loss that depends only on the policy:
\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right) \right]
Where: - \pi_\theta is the policy being trained - \pi_{\text{ref}} is the frozen reference model - y_w is the preferred (chosen) response - y_l is the rejected response - \beta controls how far the policy can deviate from the reference
| Aspect | Detail |
|---|---|
| Goal | Align outputs to human preferences without a reward model |
| Data | (prompt, chosen, rejected) triples |
| Models in memory | 2 (policy + frozen reference) |
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
# Load preference dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# LoRA configuration to reduce memory
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# DPO training configuration
dpo_config = DPOConfig(
output_dir="./dpo_output",
beta=0.1,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=5e-5,
logging_steps=10,
max_length=1024,
max_prompt_length=512,
fp16=True,
)
# Initialize DPO trainer
trainer = DPOTrainer(
model="Qwen/Qwen2.5-0.5B",
args=dpo_config,
train_dataset=dataset,
peft_config=peft_config,
)
# Start training
# trainer.train()4. KTO (Kahneman-Tversky Optimization)
KTO works with unpaired feedback — each response is independently labeled as 👍 or 👎, without requiring a corresponding preferred/rejected pair.
This is inspired by Kahneman & Tversky’s Prospect Theory: humans feel losses more strongly than equivalent gains. KTO applies different loss weights to desirable and undesirable outputs.
When to Use KTO
| Scenario | Use KTO? | Why |
|---|---|---|
| Only thumbs-up/down ratings available | ✅ Yes | KTO is designed for unpaired binary feedback |
| Responses are independently scored | ✅ Yes | No need to construct artificial pairs |
| You have paired preferences (chosen vs rejected) | ❌ Use DPO | DPO uses richer signal from direct comparisons |
| Collecting preference data is expensive | ✅ Yes | Binary feedback is cheaper to collect |
| You need maximum alignment quality | ❌ Use DPO/RLHF | Paired signals provide more information |
from trl import KTOTrainer, KTOConfig
# KTO expects a dataset with columns: prompt, completion, label (True/False)
# Example dataset structure:
# {"prompt": "Explain gravity", "completion": "Gravity is...", "label": True}
# {"prompt": "Explain gravity", "completion": "I don't know", "label": False}
kto_config = KTOConfig(
output_dir="./kto_output",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=5e-5,
logging_steps=10,
max_length=1024,
max_prompt_length=512,
fp16=True,
)
# Initialize KTO trainer
# trainer = KTOTrainer(
# model="Qwen/Qwen2.5-0.5B",
# args=kto_config,
# train_dataset=kto_dataset,
# )
# trainer.train()5. ORPO (Odds Ratio Preference Optimization)
ORPO combines SFT and preference alignment into a single training stage. It adds an odds-ratio-based penalty to the standard SFT loss, eliminating the need for a separate reference model.
| Aspect | Detail |
|---|---|
| Goal | SFT + alignment in one pass |
| Data | (prompt, chosen, rejected) triples |
| Models in memory | 1 (no reference model needed) |
| Key advantage | Simpler pipeline, lower memory |
from trl import ORPOTrainer, ORPOConfig
# Load preference dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
# ORPO configuration
orpo_config = ORPOConfig(
output_dir="./orpo_output",
beta=0.1,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=5e-5,
logging_steps=10,
max_length=1024,
max_prompt_length=512,
fp16=True,
)
# Initialize ORPO trainer
trainer = ORPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
args=orpo_config,
train_dataset=dataset,
)
# Start training
# trainer.train()6. GRPO (Group Relative Policy Optimization)
GRPO (used in DeepSeek-R1) replaces the critic model with group-level scoring. For each prompt, it generates multiple completions, scores them with reward functions, and normalizes scores within the group.
| Aspect | Detail |
|---|---|
| Goal | RL-based alignment without a critic model |
| Data | Prompts only — generations and rewards computed online |
| Models in memory | 1 (policy model) |
| Key advantage | Custom reward functions, no paired data needed |
from trl import GRPOTrainer, GRPOConfig
import re
# Load a math reasoning dataset
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
# Custom reward: check if the answer matches the expected format
def format_reward(completions, **kwargs):
"""Reward for following the expected <answer>...</answer> format."""
pattern = r"<answer>.*?</answer>"
return [1.0 if re.search(pattern, c) else 0.0 for c in completions]
# Custom reward: prefer concise responses
def length_reward(completions, **kwargs):
"""Reward shorter completions (normalized by max reasonable length)."""
max_len = 500
return [max(0, 1.0 - len(c) / max_len) for c in completions]
# GRPO configuration
grpo_config = GRPOConfig(
output_dir="./grpo_output",
num_generations=8,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=5e-6,
logging_steps=10,
max_completion_length=512,
fp16=True,
)
# Initialize GRPO trainer with custom reward functions
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B",
args=grpo_config,
train_dataset=dataset,
reward_funcs=[format_reward, length_reward],
)
# Start training
# trainer.train()7. Method Comparison
Summary of Post-Training Alignment Methods
| Method | Type | Models in Memory | Data Requirement | Key Library |
|---|---|---|---|---|
| SFT | Supervised | 1 (policy) | (instruction, response) pairs | SFTTrainer |
| DPO | Preference | 2 (policy + reference) | (prompt, chosen, rejected) triples | DPOTrainer |
| KTO | Preference | 2 (policy + reference) | (prompt, completion, label) — unpaired | KTOTrainer |
| ORPO | Hybrid (SFT + Pref) | 1 (no reference) | (prompt, chosen, rejected) triples | ORPOTrainer |
| GRPO | Reinforcement Learning | 1 (policy) | Prompts only + reward functions | GRPOTrainer |
Key Trade-offs
- SFT is the foundation — always start here
- DPO gives the strongest preference alignment but requires paired data and 2x memory
- KTO is the best option when you only have thumbs-up/down feedback
- ORPO saves a training stage by combining SFT and alignment
- GRPO is the most flexible — define any reward function, no paired data needed
8. Practical Recommendations
When to Use Each Method
- Start with SFT — Always fine-tune on high-quality instruction data first
- Use DPO if you have paired preference data and enough GPU memory for two models
- Use KTO if you only have binary (good/bad) labels without pairing
- Use ORPO if you want a simpler single-stage pipeline
- Use GRPO if you need custom reward functions or verifiable outputs (math, code)
Memory Optimization with LoRA / QLoRA
All methods support LoRA and QLoRA to dramatically reduce memory requirements:
from peft import LoraConfig
from transformers import BitsAndBytesConfig
import torch
# LoRA config — adds <1% trainable parameters
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
# QLoRA config — 4-bit quantized base model + LoRA adapters
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
print("LoRA config:", peft_config)
print("\nQLoRA quantization config:", bnb_config)