stanford-iris-lab

Meta-Harness: 76.4% on Terminal-Bench 2.0 (Claude Opus 4.6)

603
90
100% credibility
Found Apr 05, 2026 at 602 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

This repository provides an open-source AI agent framework called Meta-Harness, designed to excel on Terminal-Bench 2.0 by smartly preparing the environment before starting tasks.

How It Works

1
🔍 Discover Meta-Harness

You stumble upon this smart AI helper from Stanford researchers while looking for top performers in terminal task challenges.

2
📦 Get Ready

You grab the simple base tool it needs, making everything set up in moments.

3
🔗 Link Your AI

You connect your favorite AI thinking service so it can power the helper.

4
🚀 Launch the Challenge

With one easy go, you start the AI agent on a set of real-world terminal puzzles.

5
👀 Watch It Work

You see the agent explore, plan, and solve tasks step by step, feeling the efficiency.

🎉 Celebrate Results

You get amazing scores like 76% success across easy, medium, and hard tasks, ready to build on this win.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 602 to 603 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is meta-harness-tbench2-artifact?

This Python artifact delivers Meta-Harness, a plug-and-play agent scaffold for Terminal-Bench 2.0 (tbench2), hitting 76.4% accuracy across 89 tasks using Claude Opus 4.6. It bootstraps a sandbox snapshot—pwd, files, tools, packages—into the initial prompt, skipping wasteful early ls/which commands. Install Harbor via pip, set your Anthropic key, and run harbor run with the meta-harness agent path for instant benchmarking.

Why is it gaining traction?

It crushes tbench2 leaderboards at 76.4% (100% easy, 81.1% medium, 64.7% hard), outpacing base harnesses by injecting env intel upfront, saving 2-5 turns per run. Developers dig the zero-config CLI that swaps models seamlessly via Harbor, plus Anthropic caching for cheaper repeats. With 600 stars, it's the go-to artifact for reproducible agent evals.

Who should use this?

AI researchers benchmarking LLMs on terminal tasks, like CLI automation or sandbox scripting. Terminal agent builders tweaking prompts for Opus or similar. Eval teams at labs reproducing SOTA scores on tbench2 without rebuilding scaffolds.

Verdict

Grab it if you're into agent benchmarking—solid Harbor integration and top scores make it practical despite 1.0% credibility score and nascent docs. At 600 stars, it's mature enough for evals but watch for updates as details emerge.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.