Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers

The builders of Terminal-Bench, a benchmark suite for evaluating the efficiency of autonomous AI brokers on real-world terminal-based duties, have launched version 2.0 alongside Harbor, a brand new framework for testing, enhancing and optimizing AI brokers in containerized environments.

The twin launch goals to handle long-standing ache factors in testing and optimizing AI brokers, significantly these constructed to function autonomously in reasonable developer environments.

With a tougher and rigorously verified job set, Terminal-Bench 2.0 replaces model 1.0 as the customary for assessing frontier mannequin capabilities.

Harbor, the accompanying runtime framework, allows builders and researchers to scale evaluations throughout hundreds of cloud containers and integrates with each open-source and proprietary brokers and coaching pipelines.

“Harbor is the package deal we want we had had whereas making Terminal-Bench," wrote co-creator Alex Shaw on X. "It’s for agent, mannequin, and benchmark builders and researchers who need to consider and enhance brokers and fashions."

Increased Bar, Cleaner Information

Terminal-Bench 1.0 noticed fast adoption after its release in May 2025, changing into a default benchmark for evaluating agent efficiency throughout the discipline of AI-powered brokers working in developer-style terminal environments. These brokers work together with techniques by the command line, mimicking how builders work behind the scenes of the graphical person interface.

Nonetheless, its broad scope got here with inconsistencies. A number of duties had been recognized by the neighborhood as poorly specified or unstable due to external service adjustments.

Model 2.0 addresses these points straight. The up to date suite contains 89 duties, every subjected to a number of hours of handbook and LLM-assisted validation. The emphasis is on making duties solvable, reasonable, and clearly specified, elevating the problem ceiling whereas enhancing reliability and reproducibility.

A notable instance is the download-youtube job, which was eliminated or refactored in 2.0 due to its dependence on unstable third-party APIs.

“Astute Terminal-Bench followers could discover that SOTA efficiency is comparable to TB1.0 regardless of our declare that TB2.0 is more durable,” Shaw noted on X. “We imagine this is as a result of job high quality is considerably increased in the new benchmark.”

Harbor: Unified Rollouts at Scale

Alongside the benchmark replace, the workforce launched Harbor, a brand new framework for operating and evaluating brokers in cloud-deployed containers.

Harbor helps large-scale rollout infrastructure, with compatibility for main suppliers like Daytona and Modal.

Designed to generalize throughout agent architectures, Harbor helps:

Analysis of any container-installable agent
Scalable supervised fine-tuning (SFT) and reinforcement studying (RL) pipelines
Customized benchmark creation and deployment
Full integration with Terminal-Bench 2.

Harbor was used internally to run tens of hundreds of rollouts throughout the creation of the new benchmark. It is now publicly out there by way of harborframework.com, with documentation for testing and submitting brokers to the public leaderboard.

Early Outcomes: GPT-5 Leads in Job Success

Preliminary outcomes from the Terminal-Bench 2.0 leaderboard present OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, in the lead, with a 49.6% success charge — the highest amongst all brokers examined up to now.

Shut behind are different GPT-5 variants and Claude Sonnet 4.5-based brokers.

Prime 5 Agent Outcomes (Terminal-Bench 2.0):

Codex CLI (GPT-5) — 49.6%
Codex CLI (GPT-5-Codex) — 44.3%
OpenHands (GPT-5) — 43.8%
Terminus 2 (GPT-5-Codex) — 43.4%
Terminus 2 (Claude Sonnet 4.5) — 42.8%

The shut clustering amongst high fashions signifies lively competitors throughout platforms, with no single agent fixing greater than half the duties.

Submission and Use

To check or submit an agent, customers set up Harbor and run the benchmark utilizing easy CLI instructions. Submissions to the leaderboard require 5 benchmark runs, and outcomes could be emailed to the builders together with job directories for validation.

harbor run -d [email protected] -m "<mannequin>" -a "<agent>" –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is already being built-in into analysis workflows targeted on agentic reasoning, code era, and power use. In accordance to co-creator Mike Merrill, a postdoctoral researcher at Stanford, an in depth preprint is in progress overlaying the verification course of and design methodology behind the benchmark.

Aiming for Standardization

The mixed launch of Terminal-Bench 2.0 and Harbor marks a step towards extra constant and scalable agent analysis infrastructure. As LLM brokers proliferate in developer and operational environments, the want for managed, reproducible testing has grown.

These instruments supply a possible basis for a unified analysis stack — supporting mannequin enchancment, atmosphere simulation, and benchmark standardization throughout the AI ecosystem.

Disclaimer: This article is sourced from external platforms. OverBeta has not independently verified the information. Readers are advised to verify details before relying on them.

Your Bookmarks

Sorry, you have no bookmarks yet.

Linus Torvalds says AI-powered bug hunters...

Prime 7 Actual-time Knowledge Pipeline Platforms...

Apple’s Siri revamp might embrace auto-deleting...

Tech

AI

SEO

Security

How-To

Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers

Search

Follow Us

Join Our Community

Increased Bar, Cleaner Information

Harbor: Unified Rollouts at Scale

Early Outcomes: GPT-5 Leads in Job Success

Submission and Use

Aiming for Standardization

Read Also:

Sony AI robotic beats gamers as humanoid robotic wins Beijing race

How Google Uncover REALLY Works

What Do You Imply Glen Powell Is Fox McCloud in the ‘Tremendous...

Lies, damned lies and AI: the latest method to affect elections could...

Genetically Engineered Infants Are Banned. Tech Titans Are Attempting to Make One...

Chinese language AI firm unveils $173K ‘biometric’ robotic constructed for human companionship

Dan Houser on Victorian novels, Pink Lifeless Redemption and redefining open-world video...

How To Do A Full Native web optimization Audit:...

Electrical vehicles: might leasing a used EV provide help...

Stay Updated!

Recent Posts:

Linus Torvalds says AI-powered bug hunters have...

Prime 7 Actual-time Knowledge Pipeline Platforms for...

Apple’s Siri revamp might embrace auto-deleting chats

Gen Z Is Pioneering a New Understanding...

The Internet’s New Customer Simply Obtained An...

AI Promised the Audemars Piguet x Swatch...

Rowing by means of the fog: how...

Trump’s Tech Posse in China, Who’s Profitable...

Your Bookmarks

Sorry, you have no bookmarks yet.

Search

Follow Us

Join Our Community

Increased Bar, Cleaner Information

Harbor: Unified Rollouts at Scale

Early Outcomes: GPT-5 Leads in Job Success

Submission and Use

Aiming for Standardization

Read Also:

Post Activity

Share this post

Stay Updated!

Recent Posts: