Training LLMs for Reasoning

A practical guide to building reasoning capabilities in language models using RL, distillation, and chain-of-thought training

Published

January 20, 2025

Keywords: reasoning, chain-of-thought, GRPO, reinforcement learning, distillation, DeepSeek-R1, STaR, reward function, math reasoning, code reasoning, small models, TRL, transformers

Introduction

Standard LLMs generate responses in a single forward pass — fast, but limited for complex problems. Reasoning models break this pattern by “thinking” step-by-step before answering, producing a chain of thought (CoT) that dramatically improves performance on math, code, and logic tasks.

The breakthrough came from OpenAI’s o1 and DeepSeek-R1, which showed that pure reinforcement learning can teach models to reason without any human-annotated reasoning traces. Even more remarkably, the reasoning patterns learned by large models can be distilled into much smaller ones.

This article covers the key methods for building reasoning LLMs: chain-of-thought SFT, reward-based RL (GRPO), reasoning distillation, and self-improvement loops. All examples target small models (0.5B–3B parameters) using TRL.

For alignment fundamentals, see Post-Training LLMs for Human Alignment. For fine-tuning basics, see Fine-tuning an LLM with Unsloth and Serving with Ollama. For inference, see Decoding Methods for Text Generation with LLMs.

The Reasoning Training Landscape

There are fundamentally two paths to build a reasoning model:

graph TD
    A["Base / Instruct Model"] --> B{{"Training Approach"}}
    B -->|"Path 1: Distillation"| C["SFT on reasoning traces<br/>from a stronger model"]
    B -->|"Path 2: RL from scratch"| D["GRPO / PPO with<br/>verifiable reward functions"]
    C --> E["Reasoning Model"]
    D --> E
    B -->|"Path 3: Hybrid"| F["SFT warm-start<br/>→ RL refinement"]
    F --> E

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#e67e22,color:#fff,stroke:#333

Path	Pros	Cons
Distillation	Simple, stable, fast	Bounded by teacher quality
RL from scratch	Can surpass teacher, emergent behaviors	Unstable, needs reward design
Hybrid	Best of both worlds	More complex pipeline

DeepSeek-R1 demonstrated all three: R1-Zero used pure RL, R1 used a hybrid pipeline, and the R1-Distill family transferred R1’s reasoning to smaller models via SFT.

1. Chain-of-Thought SFT (Reasoning Distillation)

The simplest approach: train your model on (question, reasoning_trace, answer) examples generated by a stronger reasoning model. This is how DeepSeek released their R1-Distill models (1.5B to 70B) and how SmolLM3 acquired reasoning via mid-training.

graph LR
    A["Teacher Model<br/>(e.g. DeepSeek-R1,<br/>Qwen3-32B)"] -->|"Generate reasoning<br/>traces"| B["Reasoning Dataset<br/>&lt;think&gt;...&lt;/think&gt;<br/>&lt;answer&gt;...&lt;/answer&gt;"]
    B -->|"SFT"| C["Student Model<br/>(e.g. Qwen2.5-0.5B)"]
    C --> D["Small Reasoning<br/>Model"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333

Reasoning Trace Format

Models are trained to produce structured thinking before the final answer:

<think>
The problem asks for the sum of all even numbers from 1 to 100.
Even numbers: 2, 4, 6, ..., 100
This is an arithmetic series with first term a=2, last term l=100, common difference d=2.
Number of terms: n = (100-2)/2 + 1 = 50
Sum = n × (a + l) / 2 = 50 × (2 + 100) / 2 = 50 × 51 = 2550
</think>
<answer>2550</answer>

Key Datasets

Dataset	Source	Size	Domain
OpenThoughts3-1.2M	Open-R1 project	1.2M	Math, code, science
OpenMathReasoning	NVIDIA	3.2M	Math
Bespoke-Stratos-17k	Bespoke Labs	17k	General reasoning
R1-Distill data	DeepSeek	800k	Math, code, STEM

Code Example: Distillation SFTfrom trl import SFTTrainer, SFTConfig
from datasets import load_dataset
from peft import LoraConfig

# Load a reasoning trace dataset
dataset = load_dataset("bespokelabs/Bespoke-Stratos-17k", split="train")

training_args = SFTConfig(
    output_dir="Qwen2.5-0.5B-Reasoning-SFT",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=2e-5,
    max_seq_length=4096,  # reasoning traces are long
    gradient_checkpointing=True,
    bf16=True,
)

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    args=training_args,
    train_dataset=dataset,
    peft_config=LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"]),
)
trainer.train()

SmolLM3’s Mid-Training Approach

HuggingFace’s SmolLM3 (3B) used a two-phase approach for reasoning:

Reasoning mid-training: Trained for 4 epochs (~140B tokens) on OpenThoughts3-1.2M and NVIDIA’s Nemotron reasoning subset, using ChatML and wrapped packing
Dual-mode SFT: Balanced 1B non-reasoning tokens + 0.8B reasoning tokens across 22 datasets, with synthetic data from Qwen3-32B to fill coverage gaps

This produced a model that supports /think and /no_think modes.

2. RL-Based Reasoning (GRPO)

The more powerful but complex approach: use reinforcement learning to teach the model to reason by rewarding correct final answers. DeepSeek-R1-Zero proved this works with pure RL — no human demonstrations at all.

graph TD
    subgraph Generation["Generation Phase"]
        direction TB
        A1["Policy Model"] --> A2["Generate G completions<br/>per math/code prompt"]
    end

    subgraph Reward["Reward Phase"]
        direction TB
        B1["Accuracy Reward<br/>(is answer correct?)"] 
        B2["Format Reward<br/>(&lt;think&gt;...&lt;/think&gt;<br/>&lt;answer&gt;...&lt;/answer&gt;)"]
    end

    subgraph Update["Update Phase"]
        direction TB
        C1["Group-Relative<br/>Advantage<br/>A = (r - mean)/std"]
        C2["Policy Gradient<br/>with clipping"]
    end

    Generation --> Reward
    Reward --> Update
    Update -->|"repeat"| Generation

    style A1 fill:#4a90d9,color:#fff,stroke:#333
    style A2 fill:#4a90d9,color:#fff,stroke:#333
    style B1 fill:#f5a623,color:#fff,stroke:#333
    style B2 fill:#f5a623,color:#fff,stroke:#333
    style C1 fill:#e74c3c,color:#fff,stroke:#333
    style C2 fill:#e74c3c,color:#fff,stroke:#333

Why GRPO for Reasoning?

GRPO (Group Relative Policy Optimization) is the algorithm of choice for reasoning tasks because:

No value model: Unlike PPO, GRPO uses group-relative normalization instead of a learned value function, saving memory
Works with rule-based rewards: Math and code have verifiable answers — no neural reward model needed
Self-play: The model generates multiple attempts, learns from its own successes and failures

Reward Functions for Reasoning

The key insight from DeepSeek-R1: reasoning rewards decompose into two simple signals:

import re

def accuracy_reward(completions, ground_truth, **kwargs):
    """Check if the final answer matches ground truth."""
    rewards = []
    for completion, gt in zip(completions, ground_truth):
        # Extract answer from <answer> tags or \boxed{}
        match = re.search(r"<answer>(.*?)</answer>", completion)
        if not match:
            match = re.search(r"\\boxed\{(.*?)\}", completion)
        extracted = match.group(1).strip() if match else ""
        rewards.append(1.0 if extracted == gt.strip() else 0.0)
    return rewards

def format_reward(completions, **kwargs):
    """Reward structured thinking format."""
    pattern = r"<think>.*?</think>\s*<answer>.*?</answer>"
    return [0.5 if re.search(pattern, c, re.DOTALL) else 0.0 for c in completions]

Code Example: GRPO for Math Reasoningfrom trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

training_args = GRPOConfig(
    output_dir="Qwen2.5-0.5B-GRPO-Math",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=1e-6,
    num_generations=8,          # generate 8 attempts per prompt
    max_completion_length=2048,  # allow long reasoning chains
    temperature=0.7,
    gradient_checkpointing=True,
    bf16=True,
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=[accuracy_reward, format_reward],
    reward_weights=[1.0, 0.5],
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Emergent Behaviors from RL

DeepSeek-R1-Zero demonstrated remarkable emergent capabilities from pure RL:

graph LR
    A["Pure RL Training<br/>(no demonstrations)"] --> B["Self-verification<br/>'Let me check...'"]
    A --> C["Backtracking<br/>'Wait, that's wrong...'"]
    A --> D["Exploration<br/>'Another approach...'"]
    A --> E["Extended thinking<br/>(longer CoT = better)"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333

These behaviors were not trained explicitly — the model discovered them because they led to higher accuracy rewards.

3. The DeepSeek-R1 Recipe

DeepSeek-R1 is the most comprehensive open recipe for training a reasoning model. Here’s the full pipeline:

graph TD
    A["DeepSeek-V3 Base<br/>(671B MoE)"] --> B["R1-Zero<br/>Pure GRPO on<br/>math & code"]
    B --> C["Cold Start SFT<br/>Small curated set<br/>for readability"]
    C --> D["RL Phase 2<br/>Expanded rewards<br/>(helpfulness, safety)"]
    D --> E["Rejection Sampling<br/>Filter best outputs"]
    E --> F["Final SFT<br/>Polish on filtered data"]
    F --> G["DeepSeek-R1"]

    G -->|"Distill reasoning<br/>traces"| H["R1-Distill-Qwen-1.5B"]
    G -->|"Distill"| I["R1-Distill-Qwen-7B"]
    G -->|"Distill"| J["R1-Distill-Llama-8B"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333
    style F fill:#1abc9c,color:#fff,stroke:#333
    style G fill:#27ae60,color:#fff,stroke:#333
    style H fill:#3498db,color:#fff,stroke:#333
    style I fill:#3498db,color:#fff,stroke:#333
    style J fill:#3498db,color:#fff,stroke:#333

Key Stages Explained

Stage	Purpose	Key Detail
R1-Zero	Prove RL alone can develop reasoning	GRPO with accuracy + format rewards only
Cold Start SFT	Fix readability issues from pure RL	Small set of human-curated examples
RL Phase 2	Expand beyond math/code	Add helpfulness and safety rewards
Rejection Sampling	Create high-quality SFT data	Use reward model to filter model outputs
Final SFT	Polish and stabilize	Train on best-of-N filtered data
Distillation	Transfer to small models	SFT smaller models on R1’s reasoning traces

R1-Zero’s “Aha Moment”

One of the most fascinating findings: during training, R1-Zero spontaneously learned to re-evaluate its approach mid-reasoning:

<think>
Wait, let me reconsider. The approach I was taking doesn't account for
the boundary condition. Let me try a different method...

Hmm, actually I think I need to use dynamic programming here instead
of the greedy approach. Let me restart...
</think>

This self-correction emerged purely from the RL reward signal — no human ever demonstrated this behavior.

4. Self-Improvement and Rejection Sampling

An intermediate approach between pure SFT and RL: generate many completions, keep only the correct ones, and retrain. This is related to the STaR (Self-Taught Reasoner) paradigm.

graph TD
    A["Current Model"] --> B["Generate N responses<br/>per prompt<br/>(e.g., N=64)"]
    B --> C["Verify answers<br/>(math: check result,<br/>code: run tests)"]
    C --> D["Keep correct<br/>responses only"]
    D --> E["SFT on filtered<br/>dataset"]
    E --> F["Improved Model"]
    F -->|"Iterate"| A

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#1abc9c,color:#fff,stroke:#333

Code Example: Rejection Sampling Pipelinefrom transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import re

model_name = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

def generate_and_filter(prompt, ground_truth, n_samples=16):
    """Generate multiple responses and keep correct ones."""
    correct_responses = []
    for _ in range(n_samples):
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        outputs = model.generate(
            **inputs, max_new_tokens=2048,
            temperature=0.7, do_sample=True,
        )
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Check if answer is correct
        match = re.search(r"\\boxed\{(.*?)\}", response)
        if match and match.group(1).strip() == ground_truth.strip():
            correct_responses.append(response)

    return correct_responses

# Build filtered dataset from your math problems
filtered_data = []
for example in math_dataset:
    correct = generate_and_filter(example["prompt"], example["answer"])
    if correct:
        filtered_data.append({
            "prompt": example["prompt"],
            "completion": correct[0],  # take first correct response
        })

# Fine-tune on the filtered data
filtered_dataset = Dataset.from_list(filtered_data)

Comparison: Rejection Sampling vs RL

Aspect	Rejection Sampling	GRPO
Learning signal	Binary (correct/incorrect)	Continuous reward + relative ranking
Computational cost	Generate once, train once	Generate and update every step
Exploration	Limited to current policy	Active exploration via temperature
Scaling	Improves with more samples	Improves with more training steps
Stability	Very stable (just SFT)	Can be unstable, needs tuning

5. Reward Design for Reasoning

The quality of reward functions is critical for RL-based reasoning training. Here are the main types:

graph TD
    A{{"Reward Type"}} --> B["Outcome-Based<br/>(ORM)"]
    A --> C["Process-Based<br/>(PRM)"]
    A --> D["Rule-Based<br/>(Verifiers)"]

    B --> B1["Score final answer<br/>correct = 1, wrong = 0"]
    C --> C1["Score each<br/>reasoning step"]
    D --> D1["Run code tests,<br/>check math equality"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#4a90d9,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style B1 fill:#4a90d9,color:#fff,stroke:#333
    style C1 fill:#f5a623,color:#fff,stroke:#333
    style D1 fill:#27ae60,color:#fff,stroke:#333

Types of Rewards

Outcome-Based Rewards (ORM): Score only the final answer. Simple but can reward lucky guesses with wrong reasoning.

def outcome_reward(completions, ground_truth, **kwargs):
    """Binary reward: is the final answer correct?"""
    rewards = []
    for c, gt in zip(completions, ground_truth):
        match = re.search(r"<answer>(.*?)</answer>", c)
        answer = match.group(1).strip() if match else ""
        rewards.append(1.0 if answer == gt else 0.0)
    return rewards

Process-Based Rewards (PRM): Score intermediate reasoning steps. More informative but requires step-level annotations.

Rule-Based Verifiers: For code, just run the tests. For math, use symbolic solvers. This is the most reliable approach.

def code_execution_reward(completions, test_cases, **kwargs):
    """Execute generated code against test cases."""
    rewards = []
    for completion, tests in zip(completions, test_cases):
        # Extract code block
        match = re.search(r"```python\n(.*?)```", completion, re.DOTALL)
        if not match:
            rewards.append(0.0)
            continue
        code = match.group(1)
        passed = 0
        for test in tests:
            try:
                exec(code + "\n" + test, {"__builtins__": {}})
                passed += 1
            except Exception:
                pass
        rewards.append(passed / len(tests))
    return rewards

Multi-Reward Composition in GRPO

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=[accuracy_reward, format_reward, length_penalty],
    reward_weights=[1.0, 0.3, 0.1],  # accuracy dominates
    args=GRPOConfig(num_generations=8, ...),
    train_dataset=dataset,
)

6. Building a Dual-Mode Model (Think / No-Think)

Modern reasoning models support two modes — direct answering when reasoning isn’t needed, and extended thinking for complex problems. SmolLM3 demonstrated a complete open recipe for this.

graph TD
    A["User Query"] --> B{{"Mode?"}}
    B -->|"/think"| C["Generate:<br/>&lt;think&gt;step-by-step reasoning&lt;/think&gt;<br/>Final answer"]
    B -->|"/no_think"| D["Generate:<br/>&lt;think&gt;&lt;/think&gt;<br/>Direct answer"]

    subgraph Training["Training Pipeline"]
        direction TB
        E["Mid-training<br/>(35B reasoning tokens)"] --> F["Dual-mode SFT<br/>(1B no-think + 0.8B think)"]
        F --> G["Preference Alignment<br/>(APO/DPO)"]
    end

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#f5a623,color:#fff,stroke:#333
    style E fill:#9b59b6,color:#fff,stroke:#333
    style F fill:#e67e22,color:#fff,stroke:#333
    style G fill:#1abc9c,color:#fff,stroke:#333

Training a Dual-Mode Reasoner

The SmolLM3 recipe:

Reasoning mid-training (general purpose): 4 epochs on reasoning traces from OpenThoughts3 + Nemotron
Dual-mode SFT: Carefully balanced data with both thinking and non-thinking examples, synthetic data from Qwen3-32B to fill gaps
APO alignment: Anchored Preference Optimization with chosen/rejected pairs for both modes
Model merging: Combine APO checkpoint with mid-training checkpoint to recover long-context performance

Inference with Thinking Mode

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

# Reasoning mode
messages_think = [
    {"role": "system", "content": "/think"},
    {"role": "user", "content": "What is 17 × 23?"}
]

# Direct mode
messages_direct = [
    {"role": "system", "content": "/no_think"},
    {"role": "user", "content": "What is the capital of France?"}
]

text = tokenizer.apply_chat_template(messages_think, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.95)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Method Comparison

Decision Flowchart

graph TD
    A{{"What resources<br/>do you have?"}}
    A -->|"Limited GPU,<br/>quick results"| B{{"Have a teacher<br/>model?"}}
    B -->|"Yes"| C["Reasoning Distillation<br/>(CoT SFT)"]
    B -->|"No"| D["Use existing datasets<br/>(OpenThoughts, etc.)"]
    A -->|"Multi-GPU,<br/>longer training"| E{{"Verifiable<br/>tasks?"}}
    E -->|"Yes (math, code)"| F["GRPO with<br/>rule-based rewards"]
    E -->|"No (general)"| G["Hybrid: SFT warm-start<br/>+ RL refinement"]
    A -->|"Moderate GPU,<br/>iterative"| H["Rejection Sampling<br/>+ SFT loop"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#e74c3c,color:#fff,stroke:#333
    style F fill:#27ae60,color:#fff,stroke:#333
    style G fill:#27ae60,color:#fff,stroke:#333
    style H fill:#27ae60,color:#fff,stroke:#333

Summary Table

Method	Compute	Stability	Quality Ceiling	Best For
CoT SFT (Distillation)	Low	High	Bounded by teacher	Quick reasoning capability
GRPO (RL)	High	Medium	Can surpass teacher	Math, code with verifiable rewards
Rejection Sampling	Medium	High	Improves iteratively	Bootstrapping with limited compute
Hybrid (SFT → RL)	High	Medium-High	Highest	Production reasoning models
Dual-mode training	High	Medium	High	Models needing both fast + deep modes

Practical Recommendations

For single consumer GPU (16–24 GB):

Start with CoT SFT using LoRA on a reasoning dataset like Bespoke-Stratos-17k
Use rejection sampling to iteratively improve on your target domain
Quantize with bitsandbytes or GGUF for deployment

For multi-GPU cluster:

SFT warm-start on reasoning traces
GRPO with accuracy + format rewards
Rejection sampling to create a polished final SFT dataset
Serve with vLLM or Ollama

Timeline of Reasoning Models

graph LR
    A["Chain-of-Thought<br/>Prompting<br/>(Wei et al., 2022)"] --> B["STaR<br/>Self-Taught<br/>Reasoner<br/>(Zelikman, 2022)"]
    B --> C["OpenAI o1<br/>(Sep 2024)"]
    C --> D["DeepSeek-R1<br/>(Jan 2025)"]
    D --> E["Open-R1 Project<br/>(Jan 2025)"]
    D --> F["Qwen3 QwQ<br/>(2025)"]
    E --> G["SmolLM3<br/>(Jul 2025)"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#e67e22,color:#fff,stroke:#333
    style G fill:#1abc9c,color:#fff,stroke:#333

Conclusion

Training LLMs for reasoning has become increasingly accessible. The key insight from DeepSeek-R1 is that pure RL with simple rewards can develop sophisticated reasoning — but in practice, combining distillation with RL produces the best results.

For most practitioners, the fastest path to a reasoning model is:

Distill: SFT on reasoning traces from a strong teacher
Refine: GRPO or rejection sampling on your target domain
Deploy: Quantize and serve with dual-mode support

The field is evolving rapidly, with new training recipes and datasets appearing regularly. The Open-R1 project and TRL library remain the best starting points for building your own reasoning models.

For deploying your reasoning model, see Deploying and Serving LLM with Llama.cpp or Deploying and Serving LLM with vLLM.

References

Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022. arXiv:2201.11903
Zelikman et al., STaR: Bootstrapping Reasoning With Reasoning, 2022. arXiv:2203.14465
Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO), 2024. arXiv:2402.03300
DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025. arXiv:2501.12948
Bakouch et al., SmolLM3: smol, multilingual, long-context reasoner, 2025. HuggingFace Blog
Ahmadian et al., Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback (RLOO), 2024. arXiv:2402.14740
von Werra et al., TRL: Transformer Reinforcement Learning. GitHub

Explore Open-R1 for community-driven reasoning model recipes
Try TRL’s GRPO trainer with custom reward functions
Experiment with OpenThoughts3 for distillation on your domain
Read the SmolLM3 blog post for a complete open dual-mode training recipe