graph TD
A["Base / Instruct Model"] --> B{{"Training Approach"}}
B -->|"Path 1: Distillation"| C["SFT on reasoning traces<br/>from a stronger model"]
B -->|"Path 2: RL from scratch"| D["GRPO / PPO with<br/>verifiable reward functions"]
C --> E["Reasoning Model"]
D --> E
B -->|"Path 3: Hybrid"| F["SFT warm-start<br/>→ RL refinement"]
F --> E
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#f5a623,color:#fff,stroke:#333
style D fill:#9b59b6,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
style F fill:#e67e22,color:#fff,stroke:#333
Training LLMs for Reasoning
A practical guide to building reasoning capabilities in language models using RL, distillation, and chain-of-thought training
Keywords: reasoning, chain-of-thought, GRPO, reinforcement learning, distillation, DeepSeek-R1, STaR, reward function, math reasoning, code reasoning, small models, TRL, transformers

Introduction
Standard LLMs generate responses in a single forward pass — fast, but limited for complex problems. Reasoning models break this pattern by “thinking” step-by-step before answering, producing a chain of thought (CoT) that dramatically improves performance on math, code, and logic tasks.
The breakthrough came from OpenAI’s o1 and DeepSeek-R1, which showed that pure reinforcement learning can teach models to reason without any human-annotated reasoning traces. Even more remarkably, the reasoning patterns learned by large models can be distilled into much smaller ones.
This article covers the key methods for building reasoning LLMs: chain-of-thought SFT, reward-based RL (GRPO), reasoning distillation, and self-improvement loops. All examples target small models (0.5B–3B parameters) using TRL.
For alignment fundamentals, see Post-Training LLMs for Human Alignment. For fine-tuning basics, see Fine-tuning an LLM with Unsloth and Serving with Ollama. For inference, see Decoding Methods for Text Generation with LLMs.
The Reasoning Training Landscape
There are fundamentally two paths to build a reasoning model:
| Path | Pros | Cons |
|---|---|---|
| Distillation | Simple, stable, fast | Bounded by teacher quality |
| RL from scratch | Can surpass teacher, emergent behaviors | Unstable, needs reward design |
| Hybrid | Best of both worlds | More complex pipeline |
DeepSeek-R1 demonstrated all three: R1-Zero used pure RL, R1 used a hybrid pipeline, and the R1-Distill family transferred R1’s reasoning to smaller models via SFT.
1. Chain-of-Thought SFT (Reasoning Distillation)
The simplest approach: train your model on (question, reasoning_trace, answer) examples generated by a stronger reasoning model. This is how DeepSeek released their R1-Distill models (1.5B to 70B) and how SmolLM3 acquired reasoning via mid-training.
graph LR
A["Teacher Model<br/>(e.g. DeepSeek-R1,<br/>Qwen3-32B)"] -->|"Generate reasoning<br/>traces"| B["Reasoning Dataset<br/><think>...</think><br/><answer>...</answer>"]
B -->|"SFT"| C["Student Model<br/>(e.g. Qwen2.5-0.5B)"]
C --> D["Small Reasoning<br/>Model"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#f5a623,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
Reasoning Trace Format
Models are trained to produce structured thinking before the final answer:
<think>
The problem asks for the sum of all even numbers from 1 to 100.
Even numbers: 2, 4, 6, ..., 100
This is an arithmetic series with first term a=2, last term l=100, common difference d=2.
Number of terms: n = (100-2)/2 + 1 = 50
Sum = n × (a + l) / 2 = 50 × (2 + 100) / 2 = 50 × 51 = 2550
</think>
<answer>2550</answer>
Key Datasets
| Dataset | Source | Size | Domain |
|---|---|---|---|
| OpenThoughts3-1.2M | Open-R1 project | 1.2M | Math, code, science |
| OpenMathReasoning | NVIDIA | 3.2M | Math |
| Bespoke-Stratos-17k | Bespoke Labs | 17k | General reasoning |
| R1-Distill data | DeepSeek | 800k | Math, code, STEM |
Code Example: Distillation SFT
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
from peft import LoraConfig
# Load a reasoning trace dataset
dataset = load_dataset("bespokelabs/Bespoke-Stratos-17k", split="train")
training_args = SFTConfig(
output_dir="Qwen2.5-0.5B-Reasoning-SFT",
num_train_epochs=3,
per_device_train_batch_size=2,
learning_rate=2e-5,
max_seq_length=4096, # reasoning traces are long
gradient_checkpointing=True,
bf16=True,
)
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
args=training_args,
train_dataset=dataset,
peft_config=LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"]),
)
trainer.train()SmolLM3’s Mid-Training Approach
HuggingFace’s SmolLM3 (3B) used a two-phase approach for reasoning:
- Reasoning mid-training: Trained for 4 epochs (~140B tokens) on OpenThoughts3-1.2M and NVIDIA’s Nemotron reasoning subset, using ChatML and wrapped packing
- Dual-mode SFT: Balanced 1B non-reasoning tokens + 0.8B reasoning tokens across 22 datasets, with synthetic data from Qwen3-32B to fill coverage gaps
This produced a model that supports /think and /no_think modes.
2. RL-Based Reasoning (GRPO)
The more powerful but complex approach: use reinforcement learning to teach the model to reason by rewarding correct final answers. DeepSeek-R1-Zero proved this works with pure RL — no human demonstrations at all.
graph TD
subgraph Generation["Generation Phase"]
direction TB
A1["Policy Model"] --> A2["Generate G completions<br/>per math/code prompt"]
end
subgraph Reward["Reward Phase"]
direction TB
B1["Accuracy Reward<br/>(is answer correct?)"]
B2["Format Reward<br/>(<think>...</think><br/><answer>...</answer>)"]
end
subgraph Update["Update Phase"]
direction TB
C1["Group-Relative<br/>Advantage<br/>A = (r - mean)/std"]
C2["Policy Gradient<br/>with clipping"]
end
Generation --> Reward
Reward --> Update
Update -->|"repeat"| Generation
style A1 fill:#4a90d9,color:#fff,stroke:#333
style A2 fill:#4a90d9,color:#fff,stroke:#333
style B1 fill:#f5a623,color:#fff,stroke:#333
style B2 fill:#f5a623,color:#fff,stroke:#333
style C1 fill:#e74c3c,color:#fff,stroke:#333
style C2 fill:#e74c3c,color:#fff,stroke:#333
Why GRPO for Reasoning?
GRPO (Group Relative Policy Optimization) is the algorithm of choice for reasoning tasks because:
- No value model: Unlike PPO, GRPO uses group-relative normalization instead of a learned value function, saving memory
- Works with rule-based rewards: Math and code have verifiable answers — no neural reward model needed
- Self-play: The model generates multiple attempts, learns from its own successes and failures
Reward Functions for Reasoning
The key insight from DeepSeek-R1: reasoning rewards decompose into two simple signals:
import re
def accuracy_reward(completions, ground_truth, **kwargs):
"""Check if the final answer matches ground truth."""
rewards = []
for completion, gt in zip(completions, ground_truth):
# Extract answer from <answer> tags or \boxed{}
match = re.search(r"<answer>(.*?)</answer>", completion)
if not match:
match = re.search(r"\\boxed\{(.*?)\}", completion)
extracted = match.group(1).strip() if match else ""
rewards.append(1.0 if extracted == gt.strip() else 0.0)
return rewards
def format_reward(completions, **kwargs):
"""Reward structured thinking format."""
pattern = r"<think>.*?</think>\s*<answer>.*?</answer>"
return [0.5 if re.search(pattern, c, re.DOTALL) else 0.0 for c in completions]Code Example: GRPO for Math Reasoning
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
training_args = GRPOConfig(
output_dir="Qwen2.5-0.5B-GRPO-Math",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=1e-6,
num_generations=8, # generate 8 attempts per prompt
max_completion_length=2048, # allow long reasoning chains
temperature=0.7,
gradient_checkpointing=True,
bf16=True,
)
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=[accuracy_reward, format_reward],
reward_weights=[1.0, 0.5],
args=training_args,
train_dataset=dataset,
)
trainer.train()Emergent Behaviors from RL
DeepSeek-R1-Zero demonstrated remarkable emergent capabilities from pure RL:
graph LR
A["Pure RL Training<br/>(no demonstrations)"] --> B["Self-verification<br/>'Let me check...'"]
A --> C["Backtracking<br/>'Wait, that's wrong...'"]
A --> D["Exploration<br/>'Another approach...'"]
A --> E["Extended thinking<br/>(longer CoT = better)"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#27ae60,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
These behaviors were not trained explicitly — the model discovered them because they led to higher accuracy rewards.
3. The DeepSeek-R1 Recipe
DeepSeek-R1 is the most comprehensive open recipe for training a reasoning model. Here’s the full pipeline:
graph TD
A["DeepSeek-V3 Base<br/>(671B MoE)"] --> B["R1-Zero<br/>Pure GRPO on<br/>math & code"]
B --> C["Cold Start SFT<br/>Small curated set<br/>for readability"]
C --> D["RL Phase 2<br/>Expanded rewards<br/>(helpfulness, safety)"]
D --> E["Rejection Sampling<br/>Filter best outputs"]
E --> F["Final SFT<br/>Polish on filtered data"]
F --> G["DeepSeek-R1"]
G -->|"Distill reasoning<br/>traces"| H["R1-Distill-Qwen-1.5B"]
G -->|"Distill"| I["R1-Distill-Qwen-7B"]
G -->|"Distill"| J["R1-Distill-Llama-8B"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#f5a623,color:#fff,stroke:#333
style D fill:#9b59b6,color:#fff,stroke:#333
style E fill:#e67e22,color:#fff,stroke:#333
style F fill:#1abc9c,color:#fff,stroke:#333
style G fill:#27ae60,color:#fff,stroke:#333
style H fill:#3498db,color:#fff,stroke:#333
style I fill:#3498db,color:#fff,stroke:#333
style J fill:#3498db,color:#fff,stroke:#333
Key Stages Explained
| Stage | Purpose | Key Detail |
|---|---|---|
| R1-Zero | Prove RL alone can develop reasoning | GRPO with accuracy + format rewards only |
| Cold Start SFT | Fix readability issues from pure RL | Small set of human-curated examples |
| RL Phase 2 | Expand beyond math/code | Add helpfulness and safety rewards |
| Rejection Sampling | Create high-quality SFT data | Use reward model to filter model outputs |
| Final SFT | Polish and stabilize | Train on best-of-N filtered data |
| Distillation | Transfer to small models | SFT smaller models on R1’s reasoning traces |
R1-Zero’s “Aha Moment”
One of the most fascinating findings: during training, R1-Zero spontaneously learned to re-evaluate its approach mid-reasoning:
<think>
Wait, let me reconsider. The approach I was taking doesn't account for
the boundary condition. Let me try a different method...
Hmm, actually I think I need to use dynamic programming here instead
of the greedy approach. Let me restart...
</think>
This self-correction emerged purely from the RL reward signal — no human ever demonstrated this behavior.
4. Self-Improvement and Rejection Sampling
An intermediate approach between pure SFT and RL: generate many completions, keep only the correct ones, and retrain. This is related to the STaR (Self-Taught Reasoner) paradigm.
graph TD
A["Current Model"] --> B["Generate N responses<br/>per prompt<br/>(e.g., N=64)"]
B --> C["Verify answers<br/>(math: check result,<br/>code: run tests)"]
C --> D["Keep correct<br/>responses only"]
D --> E["SFT on filtered<br/>dataset"]
E --> F["Improved Model"]
F -->|"Iterate"| A
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#f5a623,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#9b59b6,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
style F fill:#1abc9c,color:#fff,stroke:#333
Code Example: Rejection Sampling Pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import re
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
def generate_and_filter(prompt, ground_truth, n_samples=16):
"""Generate multiple responses and keep correct ones."""
correct_responses = []
for _ in range(n_samples):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs, max_new_tokens=2048,
temperature=0.7, do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Check if answer is correct
match = re.search(r"\\boxed\{(.*?)\}", response)
if match and match.group(1).strip() == ground_truth.strip():
correct_responses.append(response)
return correct_responses
# Build filtered dataset from your math problems
filtered_data = []
for example in math_dataset:
correct = generate_and_filter(example["prompt"], example["answer"])
if correct:
filtered_data.append({
"prompt": example["prompt"],
"completion": correct[0], # take first correct response
})
# Fine-tune on the filtered data
filtered_dataset = Dataset.from_list(filtered_data)Comparison: Rejection Sampling vs RL
| Aspect | Rejection Sampling | GRPO |
|---|---|---|
| Learning signal | Binary (correct/incorrect) | Continuous reward + relative ranking |
| Computational cost | Generate once, train once | Generate and update every step |
| Exploration | Limited to current policy | Active exploration via temperature |
| Scaling | Improves with more samples | Improves with more training steps |
| Stability | Very stable (just SFT) | Can be unstable, needs tuning |
5. Reward Design for Reasoning
The quality of reward functions is critical for RL-based reasoning training. Here are the main types:
graph TD
A{{"Reward Type"}} --> B["Outcome-Based<br/>(ORM)"]
A --> C["Process-Based<br/>(PRM)"]
A --> D["Rule-Based<br/>(Verifiers)"]
B --> B1["Score final answer<br/>correct = 1, wrong = 0"]
C --> C1["Score each<br/>reasoning step"]
D --> D1["Run code tests,<br/>check math equality"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#4a90d9,color:#fff,stroke:#333
style C fill:#f5a623,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
style B1 fill:#4a90d9,color:#fff,stroke:#333
style C1 fill:#f5a623,color:#fff,stroke:#333
style D1 fill:#27ae60,color:#fff,stroke:#333
Types of Rewards
Outcome-Based Rewards (ORM): Score only the final answer. Simple but can reward lucky guesses with wrong reasoning.
def outcome_reward(completions, ground_truth, **kwargs):
"""Binary reward: is the final answer correct?"""
rewards = []
for c, gt in zip(completions, ground_truth):
match = re.search(r"<answer>(.*?)</answer>", c)
answer = match.group(1).strip() if match else ""
rewards.append(1.0 if answer == gt else 0.0)
return rewardsProcess-Based Rewards (PRM): Score intermediate reasoning steps. More informative but requires step-level annotations.
Rule-Based Verifiers: For code, just run the tests. For math, use symbolic solvers. This is the most reliable approach.
def code_execution_reward(completions, test_cases, **kwargs):
"""Execute generated code against test cases."""
rewards = []
for completion, tests in zip(completions, test_cases):
# Extract code block
match = re.search(r"```python\n(.*?)```", completion, re.DOTALL)
if not match:
rewards.append(0.0)
continue
code = match.group(1)
passed = 0
for test in tests:
try:
exec(code + "\n" + test, {"__builtins__": {}})
passed += 1
except Exception:
pass
rewards.append(passed / len(tests))
return rewardsMulti-Reward Composition in GRPO
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=[accuracy_reward, format_reward, length_penalty],
reward_weights=[1.0, 0.3, 0.1], # accuracy dominates
args=GRPOConfig(num_generations=8, ...),
train_dataset=dataset,
)6. Building a Dual-Mode Model (Think / No-Think)
Modern reasoning models support two modes — direct answering when reasoning isn’t needed, and extended thinking for complex problems. SmolLM3 demonstrated a complete open recipe for this.
graph TD
A["User Query"] --> B{{"Mode?"}}
B -->|"/think"| C["Generate:<br/><think>step-by-step reasoning</think><br/>Final answer"]
B -->|"/no_think"| D["Generate:<br/><think></think><br/>Direct answer"]
subgraph Training["Training Pipeline"]
direction TB
E["Mid-training<br/>(35B reasoning tokens)"] --> F["Dual-mode SFT<br/>(1B no-think + 0.8B think)"]
F --> G["Preference Alignment<br/>(APO/DPO)"]
end
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#f5a623,color:#fff,stroke:#333
style E fill:#9b59b6,color:#fff,stroke:#333
style F fill:#e67e22,color:#fff,stroke:#333
style G fill:#1abc9c,color:#fff,stroke:#333
Training a Dual-Mode Reasoner
The SmolLM3 recipe:
- Reasoning mid-training (general purpose): 4 epochs on reasoning traces from OpenThoughts3 + Nemotron
- Dual-mode SFT: Carefully balanced data with both thinking and non-thinking examples, synthetic data from Qwen3-32B to fill gaps
- APO alignment: Anchored Preference Optimization with chosen/rejected pairs for both modes
- Model merging: Combine APO checkpoint with mid-training checkpoint to recover long-context performance
Inference with Thinking Mode
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")
# Reasoning mode
messages_think = [
{"role": "system", "content": "/think"},
{"role": "user", "content": "What is 17 × 23?"}
]
# Direct mode
messages_direct = [
{"role": "system", "content": "/no_think"},
{"role": "user", "content": "What is the capital of France?"}
]
text = tokenizer.apply_chat_template(messages_think, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.95)
print(tokenizer.decode(output[0], skip_special_tokens=True))Method Comparison
Decision Flowchart
graph TD
A{{"What resources<br/>do you have?"}}
A -->|"Limited GPU,<br/>quick results"| B{{"Have a teacher<br/>model?"}}
B -->|"Yes"| C["Reasoning Distillation<br/>(CoT SFT)"]
B -->|"No"| D["Use existing datasets<br/>(OpenThoughts, etc.)"]
A -->|"Multi-GPU,<br/>longer training"| E{{"Verifiable<br/>tasks?"}}
E -->|"Yes (math, code)"| F["GRPO with<br/>rule-based rewards"]
E -->|"No (general)"| G["Hybrid: SFT warm-start<br/>+ RL refinement"]
A -->|"Moderate GPU,<br/>iterative"| H["Rejection Sampling<br/>+ SFT loop"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
style E fill:#e74c3c,color:#fff,stroke:#333
style F fill:#27ae60,color:#fff,stroke:#333
style G fill:#27ae60,color:#fff,stroke:#333
style H fill:#27ae60,color:#fff,stroke:#333
Summary Table
| Method | Compute | Stability | Quality Ceiling | Best For |
|---|---|---|---|---|
| CoT SFT (Distillation) | Low | High | Bounded by teacher | Quick reasoning capability |
| GRPO (RL) | High | Medium | Can surpass teacher | Math, code with verifiable rewards |
| Rejection Sampling | Medium | High | Improves iteratively | Bootstrapping with limited compute |
| Hybrid (SFT → RL) | High | Medium-High | Highest | Production reasoning models |
| Dual-mode training | High | Medium | High | Models needing both fast + deep modes |
Practical Recommendations
For single consumer GPU (16–24 GB):
- Start with CoT SFT using LoRA on a reasoning dataset like Bespoke-Stratos-17k
- Use rejection sampling to iteratively improve on your target domain
- Quantize with bitsandbytes or GGUF for deployment
For multi-GPU cluster:
Timeline of Reasoning Models
graph LR
A["Chain-of-Thought<br/>Prompting<br/>(Wei et al., 2022)"] --> B["STaR<br/>Self-Taught<br/>Reasoner<br/>(Zelikman, 2022)"]
B --> C["OpenAI o1<br/>(Sep 2024)"]
C --> D["DeepSeek-R1<br/>(Jan 2025)"]
D --> E["Open-R1 Project<br/>(Jan 2025)"]
D --> F["Qwen3 QwQ<br/>(2025)"]
E --> G["SmolLM3<br/>(Jul 2025)"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#f5a623,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#9b59b6,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
style F fill:#e67e22,color:#fff,stroke:#333
style G fill:#1abc9c,color:#fff,stroke:#333
Conclusion
Training LLMs for reasoning has become increasingly accessible. The key insight from DeepSeek-R1 is that pure RL with simple rewards can develop sophisticated reasoning — but in practice, combining distillation with RL produces the best results.
For most practitioners, the fastest path to a reasoning model is:
- Distill: SFT on reasoning traces from a strong teacher
- Refine: GRPO or rejection sampling on your target domain
- Deploy: Quantize and serve with dual-mode support
The field is evolving rapidly, with new training recipes and datasets appearing regularly. The Open-R1 project and TRL library remain the best starting points for building your own reasoning models.
For deploying your reasoning model, see Deploying and Serving LLM with Llama.cpp or Deploying and Serving LLM with vLLM.
References
- Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022. arXiv:2201.11903
- Zelikman et al., STaR: Bootstrapping Reasoning With Reasoning, 2022. arXiv:2203.14465
- Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO), 2024. arXiv:2402.03300
- DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025. arXiv:2501.12948
- Bakouch et al., SmolLM3: smol, multilingual, long-context reasoner, 2025. HuggingFace Blog
- Ahmadian et al., Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback (RLOO), 2024. arXiv:2402.14740
- von Werra et al., TRL: Transformer Reinforcement Learning. GitHub
Read More
- Explore Open-R1 for community-driven reasoning model recipes
- Try TRL’s GRPO trainer with custom reward functions
- Experiment with OpenThoughts3 for distillation on your domain
- Read the SmolLM3 blog post for a complete open dual-mode training recipe