Corbell-AI / evalmonkey

Public

Agent Benchmarking & Chaos Engineering Framework

agent benchmark chaos-engineering eval failure-injection

100% credibility

Found Apr 26, 2026 at 17 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

EvalMonkey is an open-source tool for testing AI agents on standard benchmarks and their ability to withstand simulated errors and disruptions.

How It Works

📖 Discover EvalMonkey

You hear about EvalMonkey, a helpful tool that tests if your AI assistant can solve problems and handle tough surprises.

🛠️ Set it up

You add it to your computer quickly, like installing a simple app.

🔗 Point to your AI

You tell it the web spot where your AI assistant waits for questions.

🎯 Run smart tests

You launch checks on math puzzles, trivia, and real tasks to measure your AI's brainpower.

🔥 Throw chaos at it

You simulate bad inputs and delays to see if your AI stays strong and doesn't crash.

📈 Celebrate reliability

You view colorful scores and trends proving your AI is ready for the real world.

Sign up to see the full architecture

4 more

Star Growth

See how this repo grew from 17 to 17 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is evalmonkey?

Evalmonkey is a Python framework for benchmarking and chaos-testing AI agents over HTTP endpoints, letting you prove capabilities on 20 standard datasets like GSM8K and MMLU before injecting failures like prompt injections or latency spikes. It solves the non-determinism of agents built with CrewAI, LangChain, OpenAI, Bedrock, or AutoGen by providing a local CLI harness—no code changes needed—that scores runs with an LLM judge and tracks a production reliability metric over time. Run `evalmonkey init crewai`, tweak a YAML config, then `evalmonkey run-benchmark --scenario mmlu` to start your agent, evaluate, and shut it down automatically.

Why is it gaining traction?

Unlike generic LLM evals like promptfoo, it targets agent-specific chaos like client-side unicode floods or server-side tool failures, plus adapters for agent github claude, agent github copilot vscode setups, and multi-agent benchmarking in interdependent tasks. Developers hook it via MCP server for Cursor or Claude Desktop to auto-benchmark while coding, and custom CSV/JSON evals pull from production logs for real ai agent benchmarking. The zero-setup HTTP contract and history CLI make iterating on agent tool use or memory fast.

Who should use this?

Agent builders deploying CrewAI crews or LangChain chains to production, especially those stressing react agent benchmarking or llm agent benchmarking on github agent repo endpoints. Teams evaluating agent github copilot cli, agent github copilot intellij integrations, or custom agent harness benchmarking studies for safety and resilience. Ideal for research agents handling hotpotqa-style multi-hop tasks or coding agents on human-eval.

Verdict

Grab it if you're serious about agent reliability—docs are thorough, CLI is polished, Apache 2.0 license—but with 17 stars and 1.0% credibility score, treat as alpha: test thoroughly before prod pipelines. Promising for agent github hq workflows, but watch for stability as it matures.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 17 stars

Bonus: AI verified quality (100%)

Account age: 49 days

Repo age: 9 days

License: Apache-2.0

Updated: Apr 25, 2026