datacurve-ai / deep-swe
PublicMeasuring frontier coding agents on original, long-horizon engineering tasks
DeepSWE is a benchmark for measuring how well AI coding assistants can solve real software engineering problems. It contains 113 tasks drawn from actual open-source projects (like FastAPI, Helm, Prometheus, and many others), each presenting a genuine feature request or bug fix. The benchmark runs in isolated containers, where an AI assistant receives a task description and must produce a working solution. Automated tests verify whether the solution is correct. Researchers and developers use this to compare different AI assistants and understand their strengths and weaknesses for programming tasks.
How It Works
You hear about DeepSWE, a benchmark that measures how well AI assistants can solve real programming problems from actual open-source projects.
You install Pier, a companion tool that runs the benchmark in isolated containers so the AI can work on tasks safely.
You choose which AI assistant to test—any of the popular ones like Claude, GPT, or Gemini—and provide access to it.
Pick 10 random tasks for a fast evaluation that still gives meaningful insights
Test all 113 tasks across different programming languages for comprehensive results
The AI tackles each task—adding features, fixing bugs, and following instructions from actual GitHub issues in real repositories.
The benchmark automatically verifies each solution and shows you a score revealing how well the AI performed.
You now have clear, objective data showing which AI assistant handles real software engineering tasks better.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.