Siddharth-1001 / agent-eval-harness

Public

An open-source evaluation framework specifically for agentic systems — not just LLM outputs, but full agent behavior.

100% credibility

Found Apr 01, 2026 at 14 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

An open-source evaluation tool that traces AI agent runs to measure tool success, latency, cost, and hallucinations across popular frameworks.

How It Works

🔍 Find the Helper

You discover a free tool that makes it easy to test and track how well your smart AI assistants handle real tasks.

📦 Get It Ready

You download the tool and set it up on your computer super quickly, no hassle.

🔗 Link to Your AI

You simply wrap your AI assistant with the tracker so it starts recording every action automatically.

▶️ Test Your AI

You give your AI some jobs to do, and it captures details like what tools it uses and how fast it goes.

📋 See the List

You open a simple list of all your test runs, showing success rates, speeds, and costs at a glance.

Dive Into Details

📊

Compare Side-by-Side

View tables that highlight differences between tests to spot improvements.

🖥️

Launch Dashboard

Open a friendly screen with charts to analyze performance deeply.

🚀 Build Better AI

You now clearly see strengths and weaknesses, so your AI gets smarter, faster, and cheaper to run.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 14 to 14 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is agent-eval-harness?

Agent-eval-harness is a Python-based open source evaluation framework for agentic AI systems, tracing full agent behavior like tool calls, turns, latency, costs, and hallucinations—not just LLM outputs. It solves subtle production failures, such as hallucinated tool arguments or hidden regressions during model updates, by capturing structured traces you can query via CLI commands like `agent-eval list`, `show`, `compare`, or spin up a local dashboard. Pip-install it, wrap your agents, and get metrics in minutes without vendor platforms.

Why is it gaining traction?

Adapters for LangGraph, CrewAI, Anthropic, OpenAI Agents SDK, and PydanticAI provide zero-friction tracing, standing out from generic open source AI evaluation tools by focusing on agent-specific metrics like tool success rates and LLM-judged hallucinations. The hook: quickstart examples run without API keys, side-by-side run diffs, and schema-based checks developers notice immediately in CI or local dev. No lock-in, just local JSON traces and a FastAPI dashboard.

Who should use this?

AI engineers building tool-using agents in LangGraph or CrewAI, debugging why Claude calls tools wrong or GPT-4o spikes latency. Teams benchmarking open source LLM evaluation frameworks across models for cost regressions before deployment. Skip if you're only evaling plain prompts.

Verdict

Worth adding to your agent eval harness toolkit now—v0.1.0 has crisp docs, CLI polish, and 85% test coverage despite 14 stars and 1.0% credibility. Early but production-ready for experiments; contribute adapters to accelerate it.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 14 stars

Bonus: AI verified quality (100%)

Account age: 1,656 days

Repo age: 6 days

License: MIT

Updated: Apr 01, 2026