Training LLMs for Reasoning

A practical guide to building reasoning capabilities in language models using RL, distillation, and chain-of-thought training

Published

January 20, 2025

Keywords: reasoning, chain-of-thought, GRPO, reinforcement learning, distillation, DeepSeek-R1, STaR, reward function, math reasoning, code reasoning, small models, TRL, transformers

Introduction

Standard LLMs generate responses in a single forward pass — fast, but limited for complex problems. Reasoning models break this pattern by “thinking” step-by-step before answering, producing a chain of thought (CoT) that dramatically improves performance on math, code, and logic tasks.

The breakthrough came from OpenAI’s o1 and DeepSeek-R1, which showed that pure reinforcement learning can teach models to reason without any human-annotated reasoning traces. Even more remarkably, the reasoning patterns learned by large models can be distilled into much smaller ones.

This article covers the key methods for building reasoning LLMs: chain-of-thought SFT, reward-based RL (GRPO), reasoning distillation, and self-improvement loops. All examples target small models (0.5B–3B parameters) using TRL.

For alignment fundamentals, see Post-Training LLMs for Human Alignment. For fine-tuning basics, see Fine-tuning an LLM with Unsloth and Serving with Ollama. For inference, see Decoding Methods for Text Generation with LLMs.

The Reasoning Training Landscape

There are fundamentally two paths to build a reasoning model:

graph TD
    A["Base / Instruct Model"] --> B{{"Training Approach"}}
    B -->|"Path 1: Distillation"| C["SFT on reasoning traces<br/>from a stronger model"]
    B -->|"Path 2: RL from scratch"| D["GRPO / PPO with<br/>verifiable reward functions"]
    C --> E["Reasoning Model"]
    D --> E
    B -->|"Path 3: Hybrid"| F["SFT warm-start<br/>→ RL refinement"]
    F --> E

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#e67e22,color:#fff,stroke:#333

Path Pros Cons
Distillation Simple, stable, fast Bounded by teacher quality
RL from scratch Can surpass teacher, emergent behaviors Unstable, needs reward design
Hybrid Best of both worlds More complex pipeline

DeepSeek-R1 demonstrated all three: R1-Zero used pure RL, R1 used a hybrid pipeline, and the R1-Distill family transferred R1’s reasoning to smaller models via SFT.

1. Chain-of-Thought SFT (Reasoning Distillation)

The simplest approach: train your model on (question, reasoning_trace, answer) examples generated by a stronger reasoning model. This is how DeepSeek released their R1-Distill models (1.5B to 70B) and how SmolLM3 acquired reasoning via mid-training.

graph LR
    A["Teacher Model<br/>(e.g. DeepSeek-R1,<br/>Qwen3-32B)"] -->|"Generate reasoning<br/>traces"| B["Reasoning Dataset<br/>&lt;think&gt;...&lt;/think&gt;<br/>&lt;answer&gt;...&lt;/answer&gt;"]
    B -->|"SFT"| C["Student Model<br/>(e.g. Qwen2.5-0.5B)"]
    C --> D["Small Reasoning<br/>Model"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333

Reasoning Trace Format

Models are trained to produce structured thinking before the final answer:

<think>
The problem asks for the sum of all even numbers from 1 to 100.
Even numbers: 2, 4, 6, ..., 100
This is an arithmetic series with first term a=2, last term l=100, common difference d=2.
Number of terms: n = (100-2)/2 + 1 = 50
Sum = n × (a + l) / 2 = 50 × (2 + 100) / 2 = 50 × 51 = 2550
</think>
<answer>2550</answer>

Key Datasets

Dataset Source Size Domain
OpenThoughts3-1.2M Open-R1 project 1.2M Math, code, science
OpenMathReasoning NVIDIA 3.2M Math
Bespoke-Stratos-17k Bespoke Labs 17k General reasoning
R1-Distill data DeepSeek 800k Math, code, STEM

Code Example: Distillation SFT

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
from peft import LoraConfig

# Load a reasoning trace dataset
dataset = load_dataset("bespokelabs/Bespoke-Stratos-17k", split="train")

training_args = SFTConfig(
    output_dir="Qwen2.5-0.5B-Reasoning-SFT",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=2e-5,
    max_seq_length=4096,  # reasoning traces are long
    gradient_checkpointing=True,
    bf16=True,
)

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    args=training_args,
    train_dataset=dataset,
    peft_config=LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"]),
)
trainer.train()

SmolLM3’s Mid-Training Approach

HuggingFace’s SmolLM3 (3B) used a two-phase approach for reasoning:

  1. Reasoning mid-training: Trained for 4 epochs (~140B tokens) on OpenThoughts3-1.2M and NVIDIA’s Nemotron reasoning subset, using ChatML and wrapped packing
  2. Dual-mode SFT: Balanced 1B non-reasoning tokens + 0.8B reasoning tokens across 22 datasets, with synthetic data from Qwen3-32B to fill coverage gaps

This produced a model that supports /think and /no_think modes.

2. RL-Based Reasoning (GRPO)

The more powerful but complex approach: use reinforcement learning to teach the model to reason by rewarding correct final answers. DeepSeek-R1-Zero proved this works with pure RL — no human demonstrations at all.

graph TD
    subgraph Generation["Generation Phase"]
        direction TB
        A1["Policy Model"] --> A2["Generate G completions<br/>per math/code prompt"]
    end

    subgraph Reward["Reward Phase"]
        direction TB
        B1["Accuracy Reward<br/>(is answer correct?)"] 
        B2["Format Reward<br/>(&lt;think&gt;...&lt;/think&gt;<br/>&lt;answer&gt;...&lt;/answer&gt;)"]
    end

    subgraph Update["Update Phase"]
        direction TB
        C1["Group-Relative<br/>Advantage<br/>A = (r - mean)/std"]
        C2["Policy Gradient<br/>with clipping"]
    end

    Generation --> Reward
    Reward --> Update
    Update -->|"repeat"| Generation

    style A1 fill:#4a90d9,color:#fff,stroke:#333
    style A2 fill:#4a90d9,color:#fff,stroke:#333
    style B1 fill:#f5a623,color:#fff,stroke:#333
    style B2 fill:#f5a623,color:#fff,stroke:#333
    style C1 fill:#e74c3c,color:#fff,stroke:#333
    style C2 fill:#e74c3c,color:#fff,stroke:#333

Why GRPO for Reasoning?

GRPO (Group Relative Policy Optimization) is the algorithm of choice for reasoning tasks because:

  1. No value model: Unlike PPO, GRPO uses group-relative normalization instead of a learned value function, saving memory
  2. Works with rule-based rewards: Math and code have verifiable answers — no neural reward model needed
  3. Self-play: The model generates multiple attempts, learns from its own successes and failures

Reward Functions for Reasoning

The key insight from DeepSeek-R1: reasoning rewards decompose into two simple signals:

import re

def accuracy_reward(completions, ground_truth, **kwargs):
    """Check if the final answer matches ground truth."""
    rewards = []
    for completion, gt in zip(completions, ground_truth):
        # Extract answer from <answer> tags or \boxed{}
        match = re.search(r"<answer>(.*?)</answer>", completion)
        if not match:
            match = re.search(r"\\boxed\{(.*?)\}", completion)
        extracted = match.group(1).strip() if match else ""
        rewards.append(1.0 if extracted == gt.strip() else 0.0)
    return rewards

def format_reward(completions, **kwargs):
    """Reward structured thinking format."""
    pattern = r"<think>.*?</think>\s*<answer>.*?</answer>"
    return [0.5 if re.search(pattern, c, re.DOTALL) else 0.0 for c in completions]

Code Example: GRPO for Math Reasoning

from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

training_args = GRPOConfig(
    output_dir="Qwen2.5-0.5B-GRPO-Math",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=1e-6,
    num_generations=8,          # generate 8 attempts per prompt
    max_completion_length=2048,  # allow long reasoning chains
    temperature=0.7,
    gradient_checkpointing=True,
    bf16=True,
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=[accuracy_reward, format_reward],
    reward_weights=[1.0, 0.5],
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Emergent Behaviors from RL

DeepSeek-R1-Zero demonstrated remarkable emergent capabilities from pure RL:

graph LR
    A["Pure RL Training<br/>(no demonstrations)"] --> B["Self-verification<br/>'Let me check...'"]
    A --> C["Backtracking<br/>'Wait, that's wrong...'"]
    A --> D["Exploration<br/>'Another approach...'"]
    A --> E["Extended thinking<br/>(longer CoT = better)"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333

These behaviors were not trained explicitly — the model discovered them because they led to higher accuracy rewards.

3. The DeepSeek-R1 Recipe

DeepSeek-R1 is the most comprehensive open recipe for training a reasoning model. Here’s the full pipeline:

graph TD
    A["DeepSeek-V3 Base<br/>(671B MoE)"] --> B["R1-Zero<br/>Pure GRPO on<br/>math & code"]
    B --> C["Cold Start SFT<br/>Small curated set<br/>for readability"]
    C --> D["RL Phase 2<br/>Expanded rewards<br/>(helpfulness, safety)"]
    D --> E["Rejection Sampling<br/>Filter best outputs"]
    E --> F["Final SFT<br/>Polish on filtered data"]
    F --> G["DeepSeek-R1"]

    G -->|"Distill reasoning<br/>traces"| H["R1-Distill-Qwen-1.5B"]
    G -->|"Distill"| I["R1-Distill-Qwen-7B"]
    G -->|"Distill"| J["R1-Distill-Llama-8B"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333
    style F fill:#1abc9c,color:#fff,stroke:#333
    style G fill:#27ae60,color:#fff,stroke:#333
    style H fill:#3498db,color:#fff,stroke:#333
    style I fill:#3498db,color:#fff,stroke:#333
    style J fill:#3498db,color:#fff,stroke:#333

Key Stages Explained

Stage Purpose Key Detail
R1-Zero Prove RL alone can develop reasoning GRPO with accuracy + format rewards only
Cold Start SFT Fix readability issues from pure RL Small set of human-curated examples
RL Phase 2 Expand beyond math/code Add helpfulness and safety rewards
Rejection Sampling Create high-quality SFT data Use reward model to filter model outputs
Final SFT Polish and stabilize Train on best-of-N filtered data
Distillation Transfer to small models SFT smaller models on R1’s reasoning traces

R1-Zero’s “Aha Moment”

One of the most fascinating findings: during training, R1-Zero spontaneously learned to re-evaluate its approach mid-reasoning:

<think>
Wait, let me reconsider. The approach I was taking doesn't account for
the boundary condition. Let me try a different method...

Hmm, actually I think I need to use dynamic programming here instead
of the greedy approach. Let me restart...
</think>

This self-correction emerged purely from the RL reward signal — no human ever demonstrated this behavior.

4. Self-Improvement and Rejection Sampling

An intermediate approach between pure SFT and RL: generate many completions, keep only the correct ones, and retrain. This is related to the STaR (Self-Taught Reasoner) paradigm.

graph TD
    A["Current Model"] --> B["Generate N responses<br/>per prompt<br/>(e.g., N=64)"]
    B --> C["Verify answers<br/>(math: check result,<br/>code: run tests)"]
    C --> D["Keep correct<br/>responses only"]
    D --> E["SFT on filtered<br/>dataset"]
    E --> F["Improved Model"]
    F -->|"Iterate"| A

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#1abc9c,color:#fff,stroke:#333

Code Example: Rejection Sampling Pipeline

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import Dataset
import re

model_name = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

def generate_and_filter(prompt, ground_truth, n_samples=16):
    """Generate multiple responses and keep correct ones."""
    correct_responses = []
    for _ in range(n_samples):
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        outputs = model.generate(
            **inputs, max_new_tokens=2048,
            temperature=0.7, do_sample=True,
        )
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Check if answer is correct
        match = re.search(r"\\boxed\{(.*?)\}", response)
        if match and match.group(1).strip() == ground_truth.strip():
            correct_responses.append(response)

    return correct_responses

# Build filtered dataset from your math problems
filtered_data = []
for example in math_dataset:
    correct = generate_and_filter(example["prompt"], example["answer"])
    if correct:
        filtered_data.append({
            "prompt": example["prompt"],
            "completion": correct[0],  # take first correct response
        })

# Fine-tune on the filtered data
filtered_dataset = Dataset.from_list(filtered_data)

Comparison: Rejection Sampling vs RL

Aspect Rejection Sampling GRPO
Learning signal Binary (correct/incorrect) Continuous reward + relative ranking
Computational cost Generate once, train once Generate and update every step
Exploration Limited to current policy Active exploration via temperature
Scaling Improves with more samples Improves with more training steps
Stability Very stable (just SFT) Can be unstable, needs tuning

5. Reward Design for Reasoning

The quality of reward functions is critical for RL-based reasoning training. Here are the main types:

graph TD
    A{{"Reward Type"}} --> B["Outcome-Based<br/>(ORM)"]
    A --> C["Process-Based<br/>(PRM)"]
    A --> D["Rule-Based<br/>(Verifiers)"]

    B --> B1["Score final answer<br/>correct = 1, wrong = 0"]
    C --> C1["Score each<br/>reasoning step"]
    D --> D1["Run code tests,<br/>check math equality"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#4a90d9,color:#fff,stroke:#333
    style C fill:#f5a623,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style B1 fill:#4a90d9,color:#fff,stroke:#333
    style C1 fill:#f5a623,color:#fff,stroke:#333
    style D1 fill:#27ae60,color:#fff,stroke:#333

Types of Rewards

Outcome-Based Rewards (ORM): Score only the final answer. Simple but can reward lucky guesses with wrong reasoning.

def outcome_reward(completions, ground_truth, **kwargs):
    """Binary reward: is the final answer correct?"""
    rewards = []
    for c, gt in zip(completions, ground_truth):
        match = re.search(r"<answer>(.*?)</answer>", c)
        answer = match.group(1).strip() if match else ""
        rewards.append(1.0 if answer == gt else 0.0)
    return rewards

Process-Based Rewards (PRM): Score intermediate reasoning steps. More informative but requires step-level annotations.

Rule-Based Verifiers: For code, just run the tests. For math, use symbolic solvers. This is the most reliable approach.

def code_execution_reward(completions, test_cases, **kwargs):
    """Execute generated code against test cases."""
    rewards = []
    for completion, tests in zip(completions, test_cases):
        # Extract code block
        match = re.search(r"```python\n(.*?)```", completion, re.DOTALL)
        if not match:
            rewards.append(0.0)
            continue
        code = match.group(1)
        passed = 0
        for test in tests:
            try:
                exec(code + "\n" + test, {"__builtins__": {}})
                passed += 1
            except Exception:
                pass
        rewards.append(passed / len(tests))
    return rewards

Multi-Reward Composition in GRPO

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=[accuracy_reward, format_reward, length_penalty],
    reward_weights=[1.0, 0.3, 0.1],  # accuracy dominates
    args=GRPOConfig(num_generations=8, ...),
    train_dataset=dataset,
)

6. Building a Dual-Mode Model (Think / No-Think)

Modern reasoning models support two modes — direct answering when reasoning isn’t needed, and extended thinking for complex problems. SmolLM3 demonstrated a complete open recipe for this.

graph TD
    A["User Query"] --> B{{"Mode?"}}
    B -->|"/think"| C["Generate:<br/>&lt;think&gt;step-by-step reasoning&lt;/think&gt;<br/>Final answer"]
    B -->|"/no_think"| D["Generate:<br/>&lt;think&gt;&lt;/think&gt;<br/>Direct answer"]

    subgraph Training["Training Pipeline"]
        direction TB
        E["Mid-training<br/>(35B reasoning tokens)"] --> F["Dual-mode SFT<br/>(1B no-think + 0.8B think)"]
        F --> G["Preference Alignment<br/>(APO/DPO)"]
    end

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#f5a623,color:#fff,stroke:#333
    style E fill:#9b59b6,color:#fff,stroke:#333
    style F fill:#e67e22,color:#fff,stroke:#333
    style G fill:#1abc9c,color:#fff,stroke:#333

Training a Dual-Mode Reasoner

The SmolLM3 recipe:

  1. Reasoning mid-training (general purpose): 4 epochs on reasoning traces from OpenThoughts3 + Nemotron
  2. Dual-mode SFT: Carefully balanced data with both thinking and non-thinking examples, synthetic data from Qwen3-32B to fill gaps
  3. APO alignment: Anchored Preference Optimization with chosen/rejected pairs for both modes
  4. Model merging: Combine APO checkpoint with mid-training checkpoint to recover long-context performance

Inference with Thinking Mode

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

# Reasoning mode
messages_think = [
    {"role": "system", "content": "/think"},
    {"role": "user", "content": "What is 17 × 23?"}
]

# Direct mode
messages_direct = [
    {"role": "system", "content": "/no_think"},
    {"role": "user", "content": "What is the capital of France?"}
]

text = tokenizer.apply_chat_template(messages_think, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.95)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Method Comparison

Decision Flowchart

graph TD
    A{{"What resources<br/>do you have?"}}
    A -->|"Limited GPU,<br/>quick results"| B{{"Have a teacher<br/>model?"}}
    B -->|"Yes"| C["Reasoning Distillation<br/>(CoT SFT)"]
    B -->|"No"| D["Use existing datasets<br/>(OpenThoughts, etc.)"]
    A -->|"Multi-GPU,<br/>longer training"| E{{"Verifiable<br/>tasks?"}}
    E -->|"Yes (math, code)"| F["GRPO with<br/>rule-based rewards"]
    E -->|"No (general)"| G["Hybrid: SFT warm-start<br/>+ RL refinement"]
    A -->|"Moderate GPU,<br/>iterative"| H["Rejection Sampling<br/>+ SFT loop"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#27ae60,color:#fff,stroke:#333
    style E fill:#e74c3c,color:#fff,stroke:#333
    style F fill:#27ae60,color:#fff,stroke:#333
    style G fill:#27ae60,color:#fff,stroke:#333
    style H fill:#27ae60,color:#fff,stroke:#333

Summary Table

Method Compute Stability Quality Ceiling Best For
CoT SFT (Distillation) Low High Bounded by teacher Quick reasoning capability
GRPO (RL) High Medium Can surpass teacher Math, code with verifiable rewards
Rejection Sampling Medium High Improves iteratively Bootstrapping with limited compute
Hybrid (SFT → RL) High Medium-High Highest Production reasoning models
Dual-mode training High Medium High Models needing both fast + deep modes

Practical Recommendations

For single consumer GPU (16–24 GB):

  1. Start with CoT SFT using LoRA on a reasoning dataset like Bespoke-Stratos-17k
  2. Use rejection sampling to iteratively improve on your target domain
  3. Quantize with bitsandbytes or GGUF for deployment

For multi-GPU cluster:

  1. SFT warm-start on reasoning traces
  2. GRPO with accuracy + format rewards
  3. Rejection sampling to create a polished final SFT dataset
  4. Serve with vLLM or Ollama

Timeline of Reasoning Models

graph LR
    A["Chain-of-Thought<br/>Prompting<br/>(Wei et al., 2022)"] --> B["STaR<br/>Self-Taught<br/>Reasoner<br/>(Zelikman, 2022)"]
    B --> C["OpenAI o1<br/>(Sep 2024)"]
    C --> D["DeepSeek-R1<br/>(Jan 2025)"]
    D --> E["Open-R1 Project<br/>(Jan 2025)"]
    D --> F["Qwen3 QwQ<br/>(2025)"]
    E --> G["SmolLM3<br/>(Jul 2025)"]

    style A fill:#4a90d9,color:#fff,stroke:#333
    style B fill:#f5a623,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#9b59b6,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style F fill:#e67e22,color:#fff,stroke:#333
    style G fill:#1abc9c,color:#fff,stroke:#333

Conclusion

Training LLMs for reasoning has become increasingly accessible. The key insight from DeepSeek-R1 is that pure RL with simple rewards can develop sophisticated reasoning — but in practice, combining distillation with RL produces the best results.

For most practitioners, the fastest path to a reasoning model is:

  1. Distill: SFT on reasoning traces from a strong teacher
  2. Refine: GRPO or rejection sampling on your target domain
  3. Deploy: Quantize and serve with dual-mode support

The field is evolving rapidly, with new training recipes and datasets appearing regularly. The Open-R1 project and TRL library remain the best starting points for building your own reasoning models.

For deploying your reasoning model, see Deploying and Serving LLM with Llama.cpp or Deploying and Serving LLM with vLLM.

References

  • Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022. arXiv:2201.11903
  • Zelikman et al., STaR: Bootstrapping Reasoning With Reasoning, 2022. arXiv:2203.14465
  • Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (GRPO), 2024. arXiv:2402.03300
  • DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025. arXiv:2501.12948
  • Bakouch et al., SmolLM3: smol, multilingual, long-context reasoner, 2025. HuggingFace Blog
  • Ahmadian et al., Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback (RLOO), 2024. arXiv:2402.14740
  • von Werra et al., TRL: Transformer Reinforcement Learning. GitHub

Read More