HumphreySun98 / repoagentbench

Public

SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: claude-code, aider (Opus 4.7 / GPT-5.5 / Sonnet 4.6 / Gemini 3.1 Pro).

agent-evals ai-agents aider benchmark claude-opus-4-7

100% credibility

Found May 11, 2026 at 24 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

RepoAgentBench turns merged pull requests from a codebase into reproducible benchmarks to evaluate AI coding agents against the project's own tests and constraints.

How It Works

🔍 Discover a smart way to test AI helpers

You hear about RepoAgentBench, a tool that lets you check which AI coding buddies really fix bugs in your own projects using real past changes.

📦 Get it ready in moments

You set it up easily on your computer so everything is prepared to start testing.

📋 Pick a real bug fix from your history

You choose a past successful change from your project's updates and turn it into a challenge with broken code and checks.

✨ See the magic setup

The tool prepares the exact broken starting point and success tests automatically, ready for AI to tackle.

Choose your test path

✅

Quick demo fix

A simple stand-in applies the known right answer to confirm everything works smoothly.

🚀

Real AI challenge

Smart AI helpers jump in to fix the bug on their own using your challenge.

📊 Review the scoreboard

You get a clear chart showing which AI passed the tests, with details on what they changed.

🎉 Know your best AI teammate

Now you clearly see which AI handles your code's real bugs best, ready to use confidently.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 24 to 24 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is repoagentbench?

RepoAgentBench turns your merged GitHub PRs into local, contamination-free benchmarks for coding agents, like swe-bench github io but for your own codebase. Mine a PR with `repoagentbench infer --from-pr `, then run agents via `repoagentbench run-one --task

--agent aider` or claude-code, verifying fixes against the project's real tests in isolated venvs. Python-based, it supports adapters for aider (Opus 4.7, GPT-5.5, Sonnet 4.6) and claude-code, plus Gemini 3.1 Pro, producing leaderboards and diffable artifacts.

Why is it gaining traction?

Unlike public swe-bench verified github datasets prone to training contamination, this mines post-cutoff PRs from your repo for realistic, reproducible evals—no data leaks. Developers hook on structured outputs like pass/fail leaderboards, events.jsonl traces, and side-by-side diffs via `repoagentbench report` or `diff`, revealing harness effects (e.g., aider vs native claude-code on the same model). It's local-first, cheap (~$11 for a sweep), and beats swe-bench lite github with per-codebase precision.

Who should use this?

Python maintainers benchmarking coding agents on their repo's bugs, like Click or Flask teams testing aider against claude-code. AI tool builders comparing adapters across Gemini 3.1 Pro, Sonnet 4.6, or Opus 4.7 before swe-bench github copilot integrations. Dev leads auditing multi swe-bench github issues for production reliability.

Verdict

Grab it for alpha testing if you're deep into coding-agent evals—CLI shines, docs are solid, but 24 stars and 1.0% credibility score signal early days; expect bugs on non-Python repos. Worth the pip install for contamination-free baselines.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 24 stars

Bonus: AI verified quality (100%)

Account age: 605 days

Repo age: 12 days

License: MIT

Updated: May 11, 2026