SWE-rebench / SWE-rebench-V2

Public

Tools and prompt templates used to build and evaluate SWE-rebench-v2 tasks for the paper.

arxiv.orgabs2602.23866

100% credibility

Found Mar 04, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

A set of tools for preparing prompts, creating isolated test environments, and evaluating code fixes on software engineering benchmark tasks.

How It Works

📚 Discover the toolbox

You come across this collection of tools while reading about testing AI on real-world coding challenges from a research paper.

📋 Gather coding tasks

You pick a list of software problems, like fixing bugs or adding features, using sample examples or shared collections.

🧠 Create task guides

You generate simple, smart instructions that help AI understand and label each coding challenge clearly.

🏗️ Prepare test spaces

You set up safe, isolated areas where each challenge can run independently without interference.

🔍 Test code fixes

You apply suggested code changes to the challenges and run the built-in tests to see what works.

📊 Get evaluation report

You receive a clear summary showing which fixes passed all tests and insights into AI performance.

Sign up to see the full architecture

4 more

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is SWE-rebench-V2?

SWE-rebench-V2 provides Python tools on GitHub to build and evaluate large-scale, language-agnostic software engineering benchmarks. You feed it JSON task lists—like repo patches and test commands—and it renders Jinja prompt templates for annotations (PR descriptions, meta info, interfaces), generates base and per-task Docker images, then runs evaluations by applying patches, executing tests in containers, and parsing logs for pass/fail results. It solves reproducible SWE agent testing, pulling tasks from Hugging Face datasets or local JSON, with optional OpenAI integration for prompt responses.

Why is it gaining traction?

Unlike generic prompt tools or Copilot extensions, it ties prompt engineering directly to Docker-based evals, making benchmarks verifiable across languages without setup hassles. Parallel workers, golden evaluations, and log parsers handle real-world test noise like timings, standing out for teams needing tools github python prompt generators that scale to thousands of tasks. The arXiv paper backing adds research cred, hooking devs tweaking SWE-bench variants.

Who should use this?

SWE researchers benchmarking AI coders on GitHub repos, teams at SAP Corp or similar evaluating prompt tools Copilot Studio outputs, or prompt engineering specialists building custom tasks from issues/PRs. Ideal for Linux/Windows devs running OSINT-style repo analysis or Julia-like reproducible experiments.

Verdict

Grab it if you're replicating the SWE-rebench paper or need Docker eval pipelines—docs are thorough, CLI scripts run out-of-box with Docker/Python 3.10. With 19 stars and 1.0% credibility score, it's early-stage research tooling, not production-ready, but a solid starting point for promptoptimizer tools builds.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 19 stars

Bonus: AI verified quality (100%)

Account age: 296 days

Repo age: 8 days

License: MIT

Updated: Mar 04, 2026