HumphreySun98

SWE-bench for your codebase β€” mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: claude-code, aider (Opus 4.7 / GPT-5.5 / Sonnet 4.6 / Gemini 3.1 Pro).

24
0
100% credibility
Found May 11, 2026 at 24 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

RepoAgentBench turns merged pull requests from a codebase into reproducible benchmarks to evaluate AI coding agents against the project's own tests and constraints.

How It Works

1
πŸ” Discover a smart way to test AI helpers

You hear about RepoAgentBench, a tool that lets you check which AI coding buddies really fix bugs in your own projects using real past changes.

2
πŸ“¦ Get it ready in moments

You set it up easily on your computer so everything is prepared to start testing.

3
πŸ“‹ Pick a real bug fix from your history

You choose a past successful change from your project's updates and turn it into a challenge with broken code and checks.

4
✨ See the magic setup

The tool prepares the exact broken starting point and success tests automatically, ready for AI to tackle.

5
Choose your test path
βœ…
Quick demo fix

A simple stand-in applies the known right answer to confirm everything works smoothly.

πŸš€
Real AI challenge

Smart AI helpers jump in to fix the bug on their own using your challenge.

6
πŸ“Š Review the scoreboard

You get a clear chart showing which AI passed the tests, with details on what they changed.

πŸŽ‰ Know your best AI teammate

Now you clearly see which AI handles your code's real bugs best, ready to use confidently.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 24 to 24 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is repoagentbench?

RepoAgentBench turns your merged GitHub PRs into local, contamination-free benchmarks for coding agents, like swe-bench github io but for your own codebase. Mine a PR with `repoagentbench infer --from-pr `, then run agents via `repoagentbench run-one --task

--agent aider` or claude-code, verifying fixes against the project's real tests in isolated venvs. Python-based, it supports adapters for aider (Opus 4.7, GPT-5.5, Sonnet 4.6) and claude-code, plus Gemini 3.1 Pro, producing leaderboards and diffable artifacts.

Why is it gaining traction?

Unlike public swe-bench verified github datasets prone to training contamination, this mines post-cutoff PRs from your repo for realistic, reproducible evalsβ€”no data leaks. Developers hook on structured outputs like pass/fail leaderboards, events.jsonl traces, and side-by-side diffs via `repoagentbench report` or `diff`, revealing harness effects (e.g., aider vs native claude-code on the same model). It's local-first, cheap (~$11 for a sweep), and beats swe-bench lite github with per-codebase precision.

Who should use this?

Python maintainers benchmarking coding agents on their repo's bugs, like Click or Flask teams testing aider against claude-code. AI tool builders comparing adapters across Gemini 3.1 Pro, Sonnet 4.6, or Opus 4.7 before swe-bench github copilot integrations. Dev leads auditing multi swe-bench github issues for production reliability.

Verdict

Grab it for alpha testing if you're deep into coding-agent evalsβ€”CLI shines, docs are solid, but 24 stars and 1.0% credibility score signal early days; expect bugs on non-Python repos. Worth the pip install for contamination-free baselines.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.