datacurve-ai

Measuring frontier coding agents on original, long-horizon engineering tasks

224
6
85% credibility
Found May 27, 2026 at 224 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Shell
AI Summary

DeepSWE is a benchmark for measuring how well AI coding assistants can solve real software engineering problems. It contains 113 tasks drawn from actual open-source projects (like FastAPI, Helm, Prometheus, and many others), each presenting a genuine feature request or bug fix. The benchmark runs in isolated containers, where an AI assistant receives a task description and must produce a working solution. Automated tests verify whether the solution is correct. Researchers and developers use this to compare different AI assistants and understand their strengths and weaknesses for programming tasks.

How It Works

1
🔬 You discover a way to test AI coding skills

You hear about DeepSWE, a benchmark that measures how well AI assistants can solve real programming problems from actual open-source projects.

2
🛠️ You install the testing tool

You install Pier, a companion tool that runs the benchmark in isolated containers so the AI can work on tasks safely.

3
🤖 You connect your favorite AI assistant

You choose which AI assistant to test—any of the popular ones like Claude, GPT, or Gemini—and provide access to it.

4
You pick what to test
🎯
Run a quick test

Pick 10 random tasks for a fast evaluation that still gives meaningful insights

🚀
Run the full benchmark

Test all 113 tasks across different programming languages for comprehensive results

5
Watch the AI work on real problems

The AI tackles each task—adding features, fixing bugs, and following instructions from actual GitHub issues in real repositories.

6
📊 Get your results

The benchmark automatically verifies each solution and shows you a score revealing how well the AI performed.

🏆 You know which AI is best for coding

You now have clear, objective data showing which AI assistant handles real software engineering tasks better.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 224 to 224 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is deep-swe?

DeepSWE is a benchmark for evaluating frontier coding agents on real software engineering tasks. Think of it as a standardized test suite, but instead of grading students, it grades AI coding assistants. The benchmark contains 113 tasks spanning TypeScript, Go, Python, JavaScript, and Rust, pulled from active open-source repositories. Each task presents agents with multi-step feature requests or bug fixes that require sustained reasoning across long codebases. Tasks run in isolated Docker environments with program-based verifiers that check observable behavior, not just code structure. You run it through Pier, a companion tool that handles sandboxed execution and supports agents like Claude Code, Codex, and Gemini CLI.

Why is it gaining traction?

Existing benchmarks like HumanEval focus on short, isolated functions. DeepSWE measures what matters for real engineering work: agents navigating complex multi-file changes over extended time horizons. The benchmark draws from actual GitHub issues and PRs, so tasks reflect genuine complexity rather than synthetic puzzles. Each task has a 90-minute timeout and generous resource limits, giving agents room to explore. The Harbor task format it uses is becoming a standard in the agent evaluation space, which means if you're building or comparing coding agents, this framework slots into existing tooling.

Who should use this?

AI/ML teams building or refining coding agents will find the most value here. If you're benchmarking your agent against competitors or tracking regression across model versions, DeepSWE provides a rigorous, reproducible test environment. Researchers studying AI capabilities in software engineering will appreciate the curated task diversity across languages and domains. Individual developers curious about where frontier agents actually struggle will enjoy browsing the task set to understand agent limitations. It's not a tool for writing code faster; it's a tool for measuring whether your agent is getting better at writing code.

Verdict

DeepSWE fills a real gap in agent evaluation by providing long-horizon, multi-file tasks grounded in production codebases. With a credibility score of 0.8500000238418579% and 224 stars, it's a credible but still early-stage project from datacurve-ai. The benchmark is well-structured and the Pier integration is straightforward, but documentation is minimal and the community is small. If you're serious about measuring coding agent performance, this is worth exploring. If you just want to ship features, move along.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.