graph LR
A["Pretraining<br/>(next-token prediction<br/>on large corpus)"] --> B["SFT<br/>(instruction<br/>fine-tuning)"]
B --> C["Preference Alignment<br/>(RLHF / DPO / ORPO<br/>/ KTO / GRPO)"]
C --> D["Deployment<br/>(quantization,<br/>serving)"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#f5a623,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
Post-Training LLMs for Human Alignment
A practical comparison of SFT, RLHF, DPO, ORPO, KTO, and GRPO for aligning pretrained language models with human preferences
Keywords: alignment, RLHF, DPO, ORPO, KTO, GRPO, SFT, PPO, preference optimization, human feedback, reward model, TRL, fine-tuning, small models, transformers

Introduction
Pretrained language models learn broad knowledge from massive text corpora, but they don’t inherently follow instructions or behave safely. Post-training alignment bridges this gap — teaching models to produce helpful, harmless, and honest responses that match human expectations.
The alignment pipeline has evolved rapidly. Early methods like RLHF required training a separate reward model and running complex reinforcement learning. Newer approaches like DPO, ORPO, and GRPO simplify this process significantly, making alignment accessible even on consumer hardware with small models.
This article compares six key alignment methods: SFT, RLHF (PPO), DPO, KTO, ORPO, and GRPO. All code examples use small models (0.5B–1B parameters) with the TRL library.
For fine-tuning fundamentals, see Fine-tuning an LLM with Unsloth and Serving with Ollama. For model compression after alignment, see Quantization Methods for LLMs. For decoding strategies during inference, see Decoding Methods for Text Generation with LLMs.
The Alignment Pipeline Overview
Before diving into individual methods, here is how post-training fits into the LLM lifecycle:
Most alignment methods require two stages: first SFT to teach the model to follow instructions, then preference optimization to refine behavior. ORPO is unique in merging both stages into one.
1. Supervised Fine-Tuning (SFT)
SFT is the foundational first step. The model is trained on (instruction, response) pairs using standard cross-entropy loss, learning to follow instructions and produce structured outputs.
graph TD
A["Pretrained Base Model"] --> B["Instruction Dataset<br/>(prompt → response pairs)"]
B --> C["Cross-Entropy Loss<br/>on target tokens"]
C --> D["SFT Model<br/>(follows instructions)"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#f5a623,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
How It Works
- The model receives a prompt (instruction) and is trained to predict the expected response token by token.
- Loss is computed only on the response tokens, not the prompt tokens.
- Common datasets: Alpaca, OpenAssistant, UltraChat.
Code Example with TRL
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
training_args = SFTConfig(
output_dir="Qwen2.5-0.5B-SFT",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=2e-5,
max_seq_length=1024,
)
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
args=training_args,
train_dataset=dataset,
)
trainer.train()Limitations
SFT teaches the model what to say but not how to discriminate between good and bad outputs. It increases the probability of both preferred and undesired response patterns. This is why a preference alignment stage is needed.
2. RLHF with PPO (Reinforcement Learning from Human Feedback)
RLHF is the classic alignment method, famously used to build InstructGPT and ChatGPT. It involves training a separate reward model on human preference data, then using Proximal Policy Optimization (PPO) to maximize the reward while staying close to the original model.
graph TD
subgraph Stage1["Stage 1: Reward Model Training"]
direction TB
A1["Human Annotators<br/>rank responses"] --> A2["Preference Dataset<br/>(prompt, chosen, rejected)"]
A2 --> A3["Train Reward Model<br/>(Bradley-Terry)"]
end
subgraph Stage2["Stage 2: PPO Fine-Tuning"]
direction TB
B1["SFT Model generates<br/>responses to prompts"] --> B2["Reward Model<br/>scores responses"]
B2 --> B3["PPO updates policy<br/>maximize reward - β·KL"]
end
Stage1 --> Stage2
style A1 fill:#4a90d9,color:#fff,stroke:#333
style A2 fill:#f5a623,color:#fff,stroke:#333
style A3 fill:#e74c3c,color:#fff,stroke:#333
style B1 fill:#4a90d9,color:#fff,stroke:#333
style B2 fill:#f5a623,color:#fff,stroke:#333
style B3 fill:#27ae60,color:#fff,stroke:#333
How It Works
- Reward Model: Trained on pairs of (chosen, rejected) responses. It learns to assign higher scores to human-preferred outputs using the Bradley-Terry ranking model.
- PPO Optimization: The policy (SFT model) generates responses, the reward model scores them, and PPO updates the policy to maximize reward while a KL divergence penalty prevents the model from drifting too far from the reference (SFT) model.
The objective is:
\max_\pi \mathbb{E}_{x \sim D, y \sim \pi}[R(x, y)] - \beta \cdot D_{KL}[\pi \| \pi_{\text{ref}}]
Key Components
| Component | Role |
|---|---|
| Policy model | The LLM being optimized |
| Reference model | Copy of SFT model (frozen), prevents reward hacking |
| Reward model | Scores generated outputs |
| Value model | Estimates expected future rewards for PPO |
Limitations
- Requires 3–4 models in memory simultaneously (policy, reference, reward, value)
- Training is unstable — sensitive to hyperparameters
- Reward model can be gamed (reward hacking)
- Complex engineering pipeline
3. DPO (Direct Preference Optimization)
DPO eliminates the need for a separate reward model by directly optimizing the policy on preference data. The key insight: the optimal RL policy can be expressed in closed form given the reward function, so we can reparametrize the reward model loss as a policy loss.
graph TD
A["SFT Model<br/>(policy + reference)"] --> B["Preference Dataset<br/>(prompt, chosen, rejected)"]
B --> C["DPO Loss<br/>binary cross-entropy<br/>on log-probability ratios"]
C --> D["Aligned Model<br/>(no reward model needed)"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#f5a623,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
How It Works
DPO defines the loss directly on preference pairs:
\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-)} \left[\log \sigma\left(\beta \left(\log \frac{\pi_\theta(y^+ | x)}{\pi_{\text{ref}}(y^+ | x)} - \log \frac{\pi_\theta(y^- | x)}{\pi_{\text{ref}}(y^- | x)}\right)\right)\right]
In practice, DPO increases the relative probability of the chosen response and decreases that of the rejected one, all while staying close to the reference model. The hyperparameter \beta controls the strength of the preference signal (typical values: 0.1–0.5).
Code Example with TRL
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
from peft import LoraConfig
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(
output_dir="Qwen2.5-0.5B-DPO",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=1e-6,
beta=0.1,
max_length=1024,
)
trainer = DPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
args=training_args,
train_dataset=dataset,
peft_config=LoraConfig(r=16, lora_alpha=32),
)
trainer.train()Advantages over RLHF
- No reward model needed — only 2 models in memory (policy + reference)
- Stable training with simple classification loss
- Much simpler to implement
- Comparable or better results on benchmarks
Dataset Format
DPO expects preference data with three fields:
# Standard format
{"prompt": "What is AI?",
"chosen": "AI is a branch of computer science...",
"rejected": "AI is when computers become sentient..."}
# Conversational format
{"prompt": [{"role": "user", "content": "What is AI?"}],
"chosen": [{"role": "assistant", "content": "AI is a branch of..."}],
"rejected": [{"role": "assistant", "content": "AI is when..."}]}4. KTO (Kahneman-Tversky Optimization)
KTO removes the requirement of paired preference data. Instead of needing (chosen, rejected) pairs for the same prompt, KTO works with individual examples labeled as simply “good” or “bad” — like a thumbs up/thumbs down signal.
graph TD
A["SFT Model"] --> B["Unpaired Feedback<br/>👍 good examples<br/>👎 bad examples"]
B --> C["KTO Loss<br/>(Kahneman-Tversky<br/>value function)"]
C --> D["Aligned Model"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#f5a623,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
How It Works
KTO is based on prospect theory from behavioral economics (Kahneman & Tversky). It models the human tendency to weigh losses more heavily than equivalent gains. The loss function treats “good” and “bad” examples independently:
- For good examples: maximize the utility of the model’s improvement over the reference
- For bad examples: penalize using a loss-averse weighting (losses hurt more than gains feel good)
When to Use KTO
| Scenario | Recommended? |
|---|---|
| You have paired preference data | DPO is generally better |
| You only have thumbs up/down labels | KTO is ideal |
| Production chatbot with user feedback | KTO is ideal |
| Creating preference data is expensive | KTO is ideal |
KTO is particularly practical when collecting production feedback — user thumbs up/down ratings are much easier to collect than pairwise comparisons.
5. ORPO (Odds Ratio Preference Optimization)
ORPO is unique: it combines SFT and preference alignment into a single training step. Instead of first doing SFT then DPO, ORPO adds an odds ratio penalty to the standard NLL (negative log-likelihood) loss, achieving both instruction-following and preference alignment simultaneously.
graph TD
A["Pretrained Base Model"] --> B["Preference Dataset<br/>(prompt, chosen, rejected)"]
B --> C["ORPO Loss<br/>= NLL + λ·OR Loss"]
subgraph LossComponents["Loss Components"]
direction LR
D["NLL Loss<br/>(SFT signal on<br/>chosen response)"]
E["Odds Ratio Loss<br/>(penalize rejected,<br/>reward chosen)"]
end
C --> LossComponents
LossComponents --> F["Aligned Model<br/>(single stage!)"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#f5a623,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#4a90d9,color:#fff,stroke:#333
style E fill:#f5a623,color:#fff,stroke:#333
style F fill:#27ae60,color:#fff,stroke:#333
How It Works
The ORPO objective is:
\mathcal{L}_{\text{ORPO}} = \mathbb{E}_{(x, y^+, y^-)} \left[\mathcal{L}_{\text{SFT}}(x, y^+) + \lambda \cdot \mathcal{L}_{\text{OR}}(x, y^+, y^-)\right]
Where \mathcal{L}_{\text{OR}} is the odds ratio loss that contrasts the likelihood of chosen vs. rejected responses. The NLL component handles instruction following (like SFT), while the OR component handles preference alignment.
Key Advantages
- No reference model required — saves 50% memory compared to DPO
- Single stage — no separate SFT step
- Computationally efficient — fewer total training steps
- Tested from 125M to 7B parameters
Code Example with TRL
from trl.experimental.orpo import ORPOTrainer, ORPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = ORPOConfig(
output_dir="Qwen2-0.5B-ORPO",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=8e-6,
beta=0.1, # λ: weight of the OR loss
max_length=1024,
)
trainer = ORPOTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=dataset,
)
trainer.train()6. GRPO (Group Relative Policy Optimization)
GRPO was introduced by DeepSeek for enhancing mathematical reasoning. Unlike DPO which uses offline preference data, GRPO is an online RL method that generates multiple completions per prompt, scores them with a reward function, and uses the relative ranking within each group to compute advantages — all without a separate value model.
graph TD
A["Policy Model"] --> B["Generate G completions<br/>per prompt"]
B --> C["Score with<br/>Reward Function"]
C --> D["Compute Group-Relative<br/>Advantages<br/>A = (r - mean) / std"]
D --> E["PPO-style Update<br/>with clipped objective"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#f5a623,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#9b59b6,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
How It Works
- For each prompt, generate G completions (e.g., G=8)
- Score each completion with a reward function (can be a model or a rule-based function)
- Compute group-relative advantage: normalize rewards within the group to get relative quality
- Update the policy using a clipped surrogate objective (like PPO, but without a value model)
The advantage for completion i is:
\hat{A}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}
Key Innovation
GRPO replaces the value model used in PPO with group-relative normalization, making it:
- Memory-efficient: No value model needed
- Self-improving: Uses model’s own generations for training (online RL)
- Flexible: Works with any reward function — including rule-based rewards (no neural reward model required)
Code Example with TRL
from trl import GRPOTrainer, GRPOConfig
from trl.rewards import accuracy_reward
from datasets import load_dataset
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
training_args = GRPOConfig(
output_dir="Qwen2.5-0.5B-GRPO",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=1e-6,
num_generations=8,
max_completion_length=256,
)
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=accuracy_reward,
args=training_args,
train_dataset=dataset,
)
trainer.train()Custom Reward Functions
One of GRPO’s strengths is its support for rule-based reward functions — no neural reward model needed:
import re
def format_reward(completions, **kwargs):
"""Reward for structured <think>...</think><answer>...</answer> format."""
pattern = r"^<think>.*?</think><answer>.*?</answer>$"
return [1.0 if re.match(pattern, c) else 0.0 for c in completions]
def length_reward(completions, **kwargs):
"""Reward longer, more detailed responses."""
return [min(len(c) / 500, 1.0) for c in completions]
# Combine multiple reward functions
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=[format_reward, length_reward],
reward_weights=[1.0, 0.5],
...
)Method Comparison
graph TD
A{{"Do you have<br/>preference data?"}}
A -->|"No, only instructions"| B["SFT"]
A -->|"Yes"| C{{"Paired or<br/>unpaired?"}}
C -->|"Unpaired<br/>(thumbs up/down)"| D["KTO"]
C -->|"Paired<br/>(chosen/rejected)"| E{{"Want single-stage<br/>training?"}}
E -->|"Yes"| F["ORPO"]
E -->|"No"| G{{"Online or<br/>offline RL?"}}
G -->|"Offline<br/>(fixed dataset)"| H["DPO"]
G -->|"Online<br/>(model generates)"| I{{"Need rule-based<br/>rewards?"}}
I -->|"Yes"| J["GRPO"]
I -->|"No, have<br/>reward model"| K["RLHF (PPO)"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#27ae60,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
style E fill:#e74c3c,color:#fff,stroke:#333
style F fill:#27ae60,color:#fff,stroke:#333
style G fill:#e74c3c,color:#fff,stroke:#333
style H fill:#27ae60,color:#fff,stroke:#333
style I fill:#e74c3c,color:#fff,stroke:#333
style J fill:#27ae60,color:#fff,stroke:#333
style K fill:#27ae60,color:#fff,stroke:#333
Summary Table
| Method | Type | Models in Memory | Needs Reward Model | Needs Reference Model | Data Requirement | Key Library |
|---|---|---|---|---|---|---|
| SFT | Supervised | 1 | No | No | (instruction, response) | TRL SFTTrainer |
| RLHF (PPO) | Online RL | 3–4 | Yes | Yes | Preference pairs + reward model | TRL PPOTrainer |
| DPO | Offline | 2 | No | Yes | Preference pairs | TRL DPOTrainer |
| KTO | Offline | 2 | No | Yes | Unpaired good/bad labels | TRL DPOTrainer (loss_type=“kto_pair”) |
| ORPO | Offline | 1 | No | No | Preference pairs | TRL ORPOTrainer |
| GRPO | Online RL | 1–2 | Optional | Optional | Prompts + reward function | TRL GRPOTrainer |
Hyperparameter Sensitivity
The \beta parameter is critical across methods. Empirical studies show:
| Method | Recommended β Range | Notes |
|---|---|---|
| DPO | 0.01 – 0.5 | Lower β often works best; 0.1 is a common default |
| KTO | 0.01 – 0.3 | Similar trends to DPO |
| ORPO | 0.1 (λ) | Controls the weight of the odds ratio loss |
| GRPO | 0.0 – 0.001 | Recent work suggests β=0 (no KL penalty) works well |
Practical Recommendations
Resource-Constrained Settings
For training on a single consumer GPU (16–24 GB VRAM) with small models:
- Start with SFT using QLoRA to teach instruction following
- Apply DPO or ORPO with LoRA adapters for preference alignment
- Use 4-bit quantization (
bitsandbytes) to fit both policy and reference models
When to Use Each Method
- SFT only: When you have high-quality instruction data and just need a helpful assistant
- DPO: Best general-purpose alignment method — simple, stable, well-tested
- ORPO: When compute is limited and you want a single-stage pipeline
- KTO: When you only have binary feedback (production chatbot settings)
- GRPO: For reasoning tasks (math, code) where you can define verifiable reward functions
- RLHF: When you have the infrastructure and need maximum control over the reward signal
Training with LoRA/QLoRA
All methods support parameter-efficient fine-tuning:
from peft import LoraConfig
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
task_type="CAUSAL_LM",
)
# Pass to any TRL trainer
trainer = DPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
peft_config=peft_config,
...
)Evolution of Alignment Methods
graph LR
A["RLHF + PPO<br/>(2022)<br/>InstructGPT"] --> B["DPO<br/>(2023)<br/>No reward model"]
B --> C["KTO<br/>(2024)<br/>Unpaired data"]
B --> D["IPO<br/>(2023)<br/>Regularized DPO"]
A --> E["ORPO<br/>(2024)<br/>Single-stage"]
A --> F["GRPO<br/>(2024)<br/>DeepSeek-Math"]
F --> G["DeepSeek-R1<br/>(2025)<br/>Reasoning RL"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#f5a623,color:#fff,stroke:#333
style D fill:#9b59b6,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
style F fill:#e67e22,color:#fff,stroke:#333
style G fill:#1abc9c,color:#fff,stroke:#333
Conclusion
Human alignment has rapidly evolved from the complex RLHF pipeline to simpler, more efficient methods. DPO remains the most popular general-purpose method due to its stability and simplicity. ORPO offers an attractive single-stage alternative. GRPO is emerging as the method of choice for reasoning tasks, especially after its success in DeepSeek-R1.
The choice of method depends on your data, compute, and use case. For most practitioners starting out, the recommended path is:
- SFT with LoRA on instruction data
- DPO with LoRA on preference data
- Quantize and deploy
For serving your aligned model, see Run LLM locally with Ollama or Deploying and Serving LLM with vLLM.
References
- Ouyang et al., Training language models to follow instructions with human feedback (InstructGPT), 2022. arXiv:2203.02155
- Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023. arXiv:2305.18290
- Ethayarajh et al., KTO: Model Alignment as Prospect Theoretic Optimization, 2024. GitHub
- Azar et al., A General Theoretical Paradigm to Understand Learning from Human Feedback (IPO), 2023. arXiv:2310.12036
- Hong et al., ORPO: Monolithic Preference Optimization without Reference Model, 2024. arXiv:2403.07691
- Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO), 2024. arXiv:2402.03300
- DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025. arXiv:2501.12948
- von Werra et al., TRL: Transformer Reinforcement Learning. GitHub
Read More
- Explore TRL documentation for the latest trainers and features
- Try the Hugging Face alignment-handbook for production recipes
- Experiment with GRPO + custom reward functions for domain-specific reasoning tasks