GeniusHTX

The official repo of our paper, "SWE-Skills-Bench:Do Agent Skills Actually Help in Real-World Software Engineering?"

17
4
100% credibility
Found Mar 24, 2026 at 17 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

SWE-Skills-Bench is a dataset of 49 real-world software engineering tasks paired with skill documents to benchmark whether providing domain-specific knowledge improves AI agent performance.

How It Works

1
📰 Discover the benchmark

You find a collection of 49 real coding challenges to test if special guides make AI helpers better at software tasks.

2
💻 Get everything ready

Download the easy starter files and connect your AI thinking service so it can join the tests.

3
📋 Pick your challenges

Browse the list of tasks like fixing bugs, adding features, or improving code, and choose what to try.

4
🏃 Run the performance tests

Launch tests twice—once with the helpful guide and once without—to see the real difference in action.

5
📊 Review the clear results

Check simple charts and reports showing pass rates, time used, and exactly how much better the guide made things.

🎉 Unlock AI insights

You now understand how skill guides boost your AI's coding powers and can improve your own projects.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 17 to 17 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is SWE-Skills-Bench?

SWE-Skills-Bench is a Python benchmark with 49 real-world software engineering tasks from actual repos, testing if feeding AI agents "skill documents" actually improves performance on fixes, features, and refactors. Load the dataset via HuggingFace for quick access, or run the full eval framework in Docker containers using Claude Code CLI and Anthropic API keys. It generates official reports comparing pass rates, token usage, and failed tests between skill-injected and baseline runs.

Why is it gaining traction?

Unlike toy benchmarks, it clones real projects like PyTorch or Metabase, runs builds/tests via pytest, and automates Docker isolation for reproducible agent evals. Batch CLI tools handle all 49 tasks with resume/dry-run flags, plus scripts for pass-rate deltas and token analysis—ideal for quantifying if skills matter. Tasks cover practical areas like GitHub Actions templates, MCP server building, and turborepo setups.

Who should use this?

AI researchers benchmarking coding agents, teams evaluating tools like Claude on SWE tasks, or devs tuning prompts for repos involving Python resilience, Spring Boot TDD, or Istio traffic management. Perfect for anyone needing empirical data on whether skill docs boost unit test pass rates in isolated environments.

Verdict

Grab it if you're deep in agent evals—the 49 tasks and reporting deliver real insights fast, despite 17 stars and 1.0% credibility signaling early-stage maturity. Docs are solid via README and CLI help; fork and extend for your models.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.