chu2bard

Evaluation framework for AI coding agents

13
0
100% credibility
Found Feb 11, 2026 at 13 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

AgentBench is an evaluation tool for testing AI coding agents on custom tasks, measuring their performance through execution tests or output similarity, and generating easy-to-read reports.

How It Works

1
🔍 Discover AgentBench

You hear about a handy tool that lets you fairly test different AI helpers on coding challenges.

2
📥 Set up the tool

You quickly add this testing kit to your setup so it's ready to use.

3
📝 Build your test challenges

You create a collection of simple coding problems, including hints on what success looks like and ways to check answers.

4
🤖 Link your AI helper

You connect one of your AI coding assistants, so it can take on the challenges.

5
▶️ Launch the evaluation

You hit go, and the tool runs your AI through every challenge, keeping track of time and effort.

6
📊 See the detailed report

A clear summary appears, showing pass rates, average scores, speeds, and breakdowns for each task.

🏆 Pick the winner

You now clearly see which AI helper shines brightest and best fits your needs.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 13 to 13 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is agentbench?

Agentbench is a Python evaluation framework for testing AI coding agents on custom benchmarks. You define tasks with prompts and test code, feed in your agent's prompt-response function (works with OpenAI or any async LLM caller), and it runs evals with execution scoring—actually executing agent-generated code against tests—or text similarity. Outputs crisp console tables or JSON reports with pass rates, avg scores, timings, and token usage, perfect for agentbench evaluating LLMs as agents.

Why is it gaining traction?

It stands out with dead-simple setup: pip install, then `run_sync(your_agent, benchmark)` for instant results, no boilerplate. Async concurrency, timeouts, and dual scorers (execution for real code checks, similarity fallback) beat clunky alternatives like manual scripting or bloated LLM evals. Developers dig the agentbench dataset flexibility—load from JSON files, directories, or inline lists—for quick custom agentbench evaluating LLMs as agents ICLR 2024-style experiments.

Who should use this?

AI engineers benchmarking code-gen LLMs on proprietary tasks, like internal code agents or RAG pipelines. Researchers iterating on agentbench OS evals or comparing models via execution tests. Teams needing an evaluation framework for LLMs without full agentbench HuggingFace overhead.

Verdict

Grab it for prototypes—MIT-licensed, installs clean on Python 3.10+, but with 12 stars and 1.0% credibility score, it's raw v0.1.0 (TODOs signal immaturity, sparse docs). Solid starter for agentbench evaluating LLMs as agents; fork and harden if it sticks.

(178 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.