graph LR
A["Traditional Code Benchmarks<br/>(HumanEval, SWE-bench)<br/>Code generation focus"] --> B["Limited to<br/>code editing"]
B --> C["Terminal Bench 2.0<br/>89 real terminal tasks<br/>Full system mastery"]
C --> D["Measures true<br/>agent autonomy"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
Terminal Bench 2.0
A harbor-native benchmark of 89 expert-crafted tasks measuring how well AI agents master real terminal environments across SWE, ML, security, and data science
Keywords: Terminal Bench, terminal-bench 2.0, AI agent benchmark, terminal mastery, Harbor framework, Stanford, Laude, software engineering, machine learning, security, data science, coding agent evaluation, LLM agent leaderboard

Introduction
Most AI benchmarks test what models know. Terminal Bench tests what agents can do — inside a real terminal, with real tools, on real tasks.
Terminal Bench 2.0 is a harbor-native benchmark of 89 high-quality tasks that measure AI agents’ ability to operate autonomously in terminal environments. Tasks span software engineering, machine learning, security, data science, scientific computing, and system administration — requiring agents to install software, debug code, crack hashes, train models, configure servers, and much more.
“Terminal-bench: benchmarks for AI agents in terminal environments. Harbor-native benchmarks to quantify agents’ terminal mastery.” — tbench.ai
What Is Terminal Bench 2.0?
Terminal Bench 2.0 is the second major version of the terminal-bench benchmark suite — a Stanford x Laude collaboration — designed to evaluate AI agents’ ability to solve complex, real-world tasks inside terminal environments. Unlike benchmarks that test isolated coding ability, Terminal Bench drops agents into full Linux environments and asks them to accomplish goals that a skilled software engineer or system administrator would handle.
Each task provides:
- A detailed natural language description of the goal
- A Docker container with the required environment pre-configured
- Automated verification scripts that check whether the task was completed correctly
Key Characteristics
| Feature | Details |
|---|---|
| Total tasks | 89 |
| Categories | Software engineering, ML, security, data science, scientific computing, system administration, debugging, and more |
| Difficulty levels | Easy, Medium, Hard |
| Evaluation | Harbor-native (via Harbor framework) |
| Metric | % Resolved — percentage of the 89 tasks fully completed |
| Anti-contamination | Canary string embedded in benchmark data |
| Versions | 1.0 (legacy), 2.0 (live), 3.0 (in development), Science 1.0 (in development) |
How Evaluation Works
Terminal Bench uses the Harbor framework for evaluation. Agents are given access to a terminal environment and must complete each task autonomously. The evaluation command is straightforward:
harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5Each task has automated verification that checks the agent’s work against precise success criteria — file contents, service availability, test outcomes, or computed results.
graph TD
A["Task Description<br/>(natural language)"] --> B["AI Agent"]
C["Docker Container<br/>(pre-configured environment)"] --> B
B --> D["Agent works in<br/>terminal autonomously"]
D --> E["Automated Verification<br/>(success criteria checks)"]
E --> F["Pass / Fail"]
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#f39c12,color:#fff,stroke:#333
style D fill:#8e44ad,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
style F fill:#2c3e50,color:#fff,stroke:#333
Who Built It?
Terminal Bench is a Stanford x Laude collaboration. The benchmark tasks were crafted by experts including:
- Nicholas Carlini — Google DeepMind researcher, prolific task creator (security, software engineering, creative challenges)
- Jan-Lucas Uslu — Task creator (security, hardware)
- Junhong Shen — Task creator (system administration, data science)
- Karl Krauth — Task creator (biology, scientific computing)
- jeffreywpli, dwahdany — Task creators (data processing, ML)
- And many other contributors from Stanford, Google, and the broader research community
| Resource | Link |
|---|---|
| Website | tbench.ai |
| Leaderboard | tbench.ai/leaderboard/terminal-bench/2.0 |
| Submission instructions | HuggingFace: harborframework/terminal-bench-2-leaderboard |
| Harbor Framework | harborframework.com |
What Skills Does It Test?
Terminal Bench 2.0 tests a remarkably diverse set of real-world terminal skills — far beyond what traditional coding benchmarks cover:
graph TD
TB["Terminal Bench 2.0<br/>89 tasks"] --> SWE["Software Engineering<br/>Build, compile, debug"]
TB --> ML["Machine Learning<br/>Train models, inference"]
TB --> SEC["Security<br/>Crack hashes, find vulns"]
TB --> DS["Data Science<br/>Process, query, analyze"]
TB --> SCI["Scientific Computing<br/>Statistics, biology, physics"]
TB --> SYS["System Administration<br/>Servers, VMs, configs"]
style TB fill:#e74c3c,color:#fff,stroke:#333
style SWE fill:#3498db,color:#fff,stroke:#333
style ML fill:#27ae60,color:#fff,stroke:#333
style SEC fill:#8e44ad,color:#fff,stroke:#333
style DS fill:#f39c12,color:#fff,stroke:#333
style SCI fill:#e67e22,color:#fff,stroke:#333
style SYS fill:#6cc3d5,color:#fff,stroke:#333
| Category | Example Tasks | Difficulty |
|---|---|---|
| Software engineering | Build POV-Ray from source, write a MIPS interpreter, implement pipeline parallelism in PyTorch | Easy–Hard |
| Machine learning | Train a FastText model on Yelp data, implement LLM inference batching scheduler, recover PyTorch model architecture | Medium–Hard |
| Security | Crack a 7z hash, exploit XSS filter bypasses, extract secrets from binaries, perform differential cryptanalysis | Medium–Hard |
| Data science | Reshard C4 dataset, merge multi-source data, optimize SQL queries, set up HuggingFace model inference | Medium |
| Scientific computing | DNA assembly primer design, Raman spectrum fitting, MCMC sampling with Stan, adaptive rejection sampling | Medium–Hard |
| System administration | Configure git webserver with auto-deploy, run Windows 3.11 in QEMU, set up mailing list servers, compile CompCert | Medium–Hard |
| Debugging | Fix OCaml garbage collector, resolve C++ heap crashes, recover corrupted SQLite databases | Medium–Hard |
What Makes These Tasks Hard?
Unlike isolated coding challenges, Terminal Bench tasks require agents to:
- Navigate complex environments — install dependencies, configure build systems, manage services
- Chain multiple skills — a single task might require downloading, compiling, configuring, and verifying
- Handle real-world messiness — legacy code (COBOL modernization), obscure formats (G-code), corrupted data (WAL recovery)
- Demonstrate deep domain knowledge — from molecular biology (DNA assembly) to cryptography (FEAL attacks) to retro computing (Windows 3.11)
Current Leaderboard
The leaderboard below shows the top-performing agent–model combinations on Terminal Bench 2.0, ranked by % Resolved (percentage of 89 tasks completed successfully).
Source: Terminal Bench 2.0 Leaderboard (consulted July 2025). 120 total entries. Results verified by Terminal Bench team members.
| Rank | Agent | Model | Organization | % Resolved |
|---|---|---|---|---|
| 1 | ForgeCode | Claude Opus 4.6 | ForgeCode / Anthropic | 81.8 ± 1.7 |
| 1 | ForgeCode | GPT-5.4 | ForgeCode / OpenAI | 81.8 ± 2.0 |
| 3 | TongAgents | Gemini 3.1 Pro | BIGAI / Google | 80.2 ± 2.6 |
| 4 | ForgeCode | Gemini 3.1 Pro | ForgeCode / Google | 78.4 ± 1.8 |
| 5 | SageAgent | GPT-5.3-Codex | OpenSage / OpenAI | 78.4 ± 2.2 |
| 6 | Droid | GPT-5.3-Codex | Factory / OpenAI | 77.3 ± 2.2 |
| 7 | Capy | Claude Opus 4.6 | Capy / Anthropic | 75.3 ± 2.4 |
| 8 | Simple Codex | GPT-5.3-Codex | OpenAI | 75.1 ± 2.4 |
| 9 | Terminus-KIRA | Gemini 3.1 Pro | KRAFTON AI / Google | 74.8 ± 2.6 |
| 10 | Terminus-KIRA | Claude Opus 4.6 | KRAFTON AI / Anthropic | 74.7 ± 2.6 |
| 11 | Mux | GPT-5.3-Codex | Coder / OpenAI | 74.6 ± 2.5 |
| 12 | MAYA-V2 | Claude Opus 4.6 | ADYA / Anthropic | 72.1 ± 2.2 |
| 13 | TongAgents | Claude Opus 4.6 | BIGAI / Anthropic | 71.9 ± 2.7 |
| 14 | Junie CLI | Multiple | JetBrains | 71.0 ± 2.9 |
| 15 | CodeBrain-1 | GPT-5.3-Codex | Feeling AI / OpenAI | 70.3 ± 2.6 |
| 16 | Droid | Claude Opus 4.6 | Factory / Anthropic | 69.9 ± 2.5 |
| 17 | Ante | Gemini 3 Pro | Antigma Labs / Google | 69.4 ± 2.1 |
| 18 | IndusAGI | GPT-5.3-Codex | SoloVpx / OpenAI | 69.1 ± 2.3 |
| 19 | Crux | Claude Opus 4.6 | Roam / Anthropic | 66.9 |
| 20 | Mux | Claude Opus 4.6 | Coder / Anthropic | 66.5 ± 2.5 |
Key takeaway: The top agents now solve over 80% of Terminal Bench 2.0 tasks, but the hardest ~20% — involving deep domain expertise in cryptography, biology, and complex systems — remain largely unsolved. The leaderboard features 120 entries from diverse organizations, with specialized agent frameworks (ForgeCode, Droid, TongAgents) consistently outperforming general-purpose CLI tools.
For the full, up-to-date leaderboard, visit the links in the next section.
Where to Explore the Benchmark
Dashboards and Leaderboards
| Resource | Description | Link |
|---|---|---|
| Terminal Bench 2.0 Leaderboard | Full ranked leaderboard with all 120 entries, agents, and models | tbench.ai/leaderboard/terminal-bench/2.0 |
| Task Registry | Browse all 89 tasks with descriptions, categories, and difficulty | tbench.ai/benchmarks/terminal-bench-2 |
| Terminal Bench Home | Overview of all benchmark versions and upcoming releases | tbench.ai |
Submission and Evaluation
| Resource | Description | Link |
|---|---|---|
| Submission Instructions | How to submit your agent to the leaderboard via HuggingFace | HF: harborframework/terminal-bench-2-leaderboard |
| Harbor Framework | The evaluation framework used to run Terminal Bench | harborframework.com |
Run the Benchmark
# Install Harbor and run Terminal Bench 2.0
harbor run -d terminal-bench@2.0 -a "your-agent" -m "your-model" -k 5Understanding the Metrics
% Resolved
The primary metric. Each task is binary — either fully completed (verified by automated checks) or not. The score is the percentage of 89 tasks that the agent resolved successfully.
Confidence Intervals
Each leaderboard entry includes a confidence interval (± value) reflecting variance across evaluation runs. Smaller intervals indicate more consistent agent performance.
Agent vs. Model
Terminal Bench uniquely separates the agent framework (e.g., ForgeCode, Droid, Claude Code) from the underlying model (e.g., Claude Opus 4.6, GPT-5.4). This reveals that:
- The same model performs very differently across agent frameworks
- Specialized agent frameworks consistently outperform general-purpose tools
- The agent scaffolding matters as much as the model capability
graph LR
A["Model Capability<br/>(reasoning, knowledge)"] --> C["Task Resolution"]
B["Agent Framework<br/>(tool use, planning)"] --> C
C --> D["% Resolved<br/>on Terminal Bench"]
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#8e44ad,color:#fff,stroke:#333
Why Terminal Bench Matters
graph LR
A["Code-only<br/>benchmarks"] --> B["Don't test<br/>system mastery"]
B --> C["Terminal Bench<br/>fills the gap"]
C --> D["Measures real<br/>agent autonomy"]
A2["Isolated<br/>task evals"] --> B2["Miss multi-step<br/>complexity"]
B2 --> C
C --> D2["Drives agent<br/>framework innovation"]
style A fill:#e74c3c,color:#fff,stroke:#333
style A2 fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style D2 fill:#3498db,color:#fff,stroke:#333
- Tests real autonomy — Agents must navigate full environments, not just generate code snippets
- Broad skill coverage — From biology to cryptography to system administration, no single skill suffices
- Separates agent from model — Reveals that scaffolding and tool use matter as much as raw model capability
- Practical relevance — Tasks mirror what engineers actually do in terminals every day
- Anti-contamination — Canary strings and automated verification prevent benchmark gaming
Video: Terminal Bench 2.0 Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
Terminal Bench 2.0 sets a new standard for evaluating AI agents in real-world terminal environments:
- 89 expert-crafted tasks spanning software engineering, ML, security, data science, scientific computing, and system administration
- Built as a Stanford x Laude collaboration with tasks from leading researchers including Nicholas Carlini
- Evaluated via the Harbor framework — reproducible, containerized, and open for submissions
- The best agents solve ~82% of tasks, but the hardest challenges in cryptography, biology, and complex systems remain unsolved
- The benchmark uniquely separates agent framework from model, revealing that scaffolding matters as much as raw capability
With Terminal Bench 3.0 and Terminal Bench Science already in development, this benchmark family is rapidly evolving to keep pace with agent capabilities — ensuring we have a rigorous measure of what AI agents can truly accomplish when given a terminal and a goal.
References
- Terminal Bench Team. “Terminal-Bench: Benchmarks for AI Agents in Terminal Environments.” tbench.ai
- Terminal Bench Team. “Terminal Bench 2.0 Leaderboard.” tbench.ai/leaderboard/terminal-bench/2.0 (consulted July 2025)
- Terminal Bench Team. “Terminal Bench 2.0 Task Registry.” tbench.ai/benchmarks/terminal-bench-2
- Harbor Framework. “Harbor: Evaluation Framework for AI Agents.” harborframework.com
- Harbor Framework. “Terminal Bench 2 Leaderboard Submissions.” HuggingFace. huggingface.co/datasets/harborframework/terminal-bench-2-leaderboard
Read More
- See how agents tackle real GitHub issues — SWE-bench Verified
- Evaluate factual accuracy in LLMs — SimpleQA
- Test visual grounding on screens — ScreenSpot Pro
- Explore the hardest academic benchmark — Humanity’s Last Exam
- Deploy models for running your own evaluations — Deploying and Serving LLM with vLLM
- Track costs when running evaluations — FinOps Best Practices for LLM Applications
- Terminal Bench Official Website
- Terminal Bench 2.0 Leaderboard
- Harbor Framework