Post-Training LLMs for Human Alignment

A practical guide to SFT, DPO, KTO, ORPO, and GRPO — aligning language models with human preferences using the TRL library

Open In Colab

📖 Read the full article

Setup & Installation
SFT (Supervised Fine-Tuning)
DPO (Direct Preference Optimization)
KTO (Kahneman-Tversky Optimization)
ORPO (Odds Ratio Preference Optimization)
GRPO (Group Relative Policy Optimization)
Method Comparison
Practical Recommendations

1. Setup & Installation

Install the required packages for post-training alignment methods.

!pip install -q trl transformers datasets peft accelerate bitsandbytes

2. SFT (Supervised Fine-Tuning)

SFT is the first step in post-training alignment. It teaches the model to follow instructions by training on high-quality (instruction, response) pairs.

Aspect	Detail
Goal	Teach the model to follow instructions
Data	(instruction, response) pairs
Loss	Standard cross-entropy on response tokens
Models in memory	1 (policy model)

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# Load a high-quality instruction dataset
dataset = load_dataset("trl-lib/Capybara", split="train")

# Configure SFT training
sft_config = SFTConfig(
    output_dir="./sft_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-5,
    logging_steps=10,
    max_seq_length=1024,
    fp16=True,
)

# Initialize SFT trainer
trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    args=sft_config,
    train_dataset=dataset,
)

# Start training
# trainer.train()

3. DPO (Direct Preference Optimization)

DPO skips the reward model entirely and directly optimizes the policy using preference pairs (chosen vs. rejected).

DPO Loss Formula

The DPO objective reparameterizes the RLHF reward function to derive a loss that depends only on the policy:

\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\text{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\text{ref}}(y_l | x)} \right) \right]

Where: - \pi_\theta is the policy being trained - \pi_{\text{ref}} is the frozen reference model - y_w is the preferred (chosen) response - y_l is the rejected response - \beta controls how far the policy can deviate from the reference

Aspect	Detail
Goal	Align outputs to human preferences without a reward model
Data	(prompt, chosen, rejected) triples
Models in memory	2 (policy + frozen reference)

from trl import DPOTrainer, DPOConfig
from peft import LoraConfig

# Load preference dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

# LoRA configuration to reduce memory
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# DPO training configuration
dpo_config = DPOConfig(
    output_dir="./dpo_output",
    beta=0.1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-5,
    logging_steps=10,
    max_length=1024,
    max_prompt_length=512,
    fp16=True,
)

# Initialize DPO trainer
trainer = DPOTrainer(
    model="Qwen/Qwen2.5-0.5B",
    args=dpo_config,
    train_dataset=dataset,
    peft_config=peft_config,
)

# Start training
# trainer.train()

4. KTO (Kahneman-Tversky Optimization)

KTO works with unpaired feedback — each response is independently labeled as 👍 or 👎, without requiring a corresponding preferred/rejected pair.

This is inspired by Kahneman & Tversky’s Prospect Theory: humans feel losses more strongly than equivalent gains. KTO applies different loss weights to desirable and undesirable outputs.

When to Use KTO

Scenario	Use KTO?	Why
Only thumbs-up/down ratings available	✅ Yes	KTO is designed for unpaired binary feedback
Responses are independently scored	✅ Yes	No need to construct artificial pairs
You have paired preferences (chosen vs rejected)	❌ Use DPO	DPO uses richer signal from direct comparisons
Collecting preference data is expensive	✅ Yes	Binary feedback is cheaper to collect
You need maximum alignment quality	❌ Use DPO/RLHF	Paired signals provide more information

from trl import KTOTrainer, KTOConfig

# KTO expects a dataset with columns: prompt, completion, label (True/False)
# Example dataset structure:
# {"prompt": "Explain gravity", "completion": "Gravity is...", "label": True}
# {"prompt": "Explain gravity", "completion": "I don't know", "label": False}

kto_config = KTOConfig(
    output_dir="./kto_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-5,
    logging_steps=10,
    max_length=1024,
    max_prompt_length=512,
    fp16=True,
)

# Initialize KTO trainer
# trainer = KTOTrainer(
#     model="Qwen/Qwen2.5-0.5B",
#     args=kto_config,
#     train_dataset=kto_dataset,
# )

# trainer.train()

5. ORPO (Odds Ratio Preference Optimization)

ORPO combines SFT and preference alignment into a single training stage. It adds an odds-ratio-based penalty to the standard SFT loss, eliminating the need for a separate reference model.

Aspect	Detail
Goal	SFT + alignment in one pass
Data	(prompt, chosen, rejected) triples
Models in memory	1 (no reference model needed)
Key advantage	Simpler pipeline, lower memory

from trl import ORPOTrainer, ORPOConfig

# Load preference dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

# ORPO configuration
orpo_config = ORPOConfig(
    output_dir="./orpo_output",
    beta=0.1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-5,
    logging_steps=10,
    max_length=1024,
    max_prompt_length=512,
    fp16=True,
)

# Initialize ORPO trainer
trainer = ORPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    args=orpo_config,
    train_dataset=dataset,
)

# Start training
# trainer.train()

6. GRPO (Group Relative Policy Optimization)

GRPO (used in DeepSeek-R1) replaces the critic model with group-level scoring. For each prompt, it generates multiple completions, scores them with reward functions, and normalizes scores within the group.

Aspect	Detail
Goal	RL-based alignment without a critic model
Data	Prompts only — generations and rewards computed online
Models in memory	1 (policy model)
Key advantage	Custom reward functions, no paired data needed

from trl import GRPOTrainer, GRPOConfig
import re

# Load a math reasoning dataset
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")


# Custom reward: check if the answer matches the expected format
def format_reward(completions, **kwargs):
    """Reward for following the expected <answer>...</answer> format."""
    pattern = r"<answer>.*?</answer>"
    return [1.0 if re.search(pattern, c) else 0.0 for c in completions]


# Custom reward: prefer concise responses
def length_reward(completions, **kwargs):
    """Reward shorter completions (normalized by max reasonable length)."""
    max_len = 500
    return [max(0, 1.0 - len(c) / max_len) for c in completions]


# GRPO configuration
grpo_config = GRPOConfig(
    output_dir="./grpo_output",
    num_generations=8,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-6,
    logging_steps=10,
    max_completion_length=512,
    fp16=True,
)

# Initialize GRPO trainer with custom reward functions
trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B",
    args=grpo_config,
    train_dataset=dataset,
    reward_funcs=[format_reward, length_reward],
)

# Start training
# trainer.train()

7. Method Comparison

Summary of Post-Training Alignment Methods

Method	Type	Models in Memory	Data Requirement	Key Library
SFT	Supervised	1 (policy)	(instruction, response) pairs	`SFTTrainer`
DPO	Preference	2 (policy + reference)	(prompt, chosen, rejected) triples	`DPOTrainer`
KTO	Preference	2 (policy + reference)	(prompt, completion, label) — unpaired	`KTOTrainer`
ORPO	Hybrid (SFT + Pref)	1 (no reference)	(prompt, chosen, rejected) triples	`ORPOTrainer`
GRPO	Reinforcement Learning	1 (policy)	Prompts only + reward functions	`GRPOTrainer`

Key Trade-offs

SFT is the foundation — always start here
DPO gives the strongest preference alignment but requires paired data and 2x memory
KTO is the best option when you only have thumbs-up/down feedback
ORPO saves a training stage by combining SFT and alignment
GRPO is the most flexible — define any reward function, no paired data needed

8. Practical Recommendations

When to Use Each Method

Start with SFT — Always fine-tune on high-quality instruction data first
Use DPO if you have paired preference data and enough GPU memory for two models
Use KTO if you only have binary (good/bad) labels without pairing
Use ORPO if you want a simpler single-stage pipeline
Use GRPO if you need custom reward functions or verifiable outputs (math, code)

Memory Optimization with LoRA / QLoRA

All methods support LoRA and QLoRA to dramatically reduce memory requirements:

from peft import LoraConfig
from transformers import BitsAndBytesConfig
import torch

# LoRA config — adds <1% trainable parameters
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# QLoRA config — 4-bit quantized base model + LoRA adapters
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

print("LoRA config:", peft_config)
print("\nQLoRA quantization config:", bnb_config)

Table of Contents