mgechev / skill-eval

Public

Unit tests for your agent skills

blog.mgechev.com20260226skill-eval

100% credibility

Found Mar 04, 2026 at 22 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

TypeScript

AI Summary

Skill Eval is a framework for testing AI agents on tasks by running them in isolated environments, scoring outcomes with checks and reviews, and providing performance metrics across multiple attempts.

How It Works

📖 Discover Skill Eval

You hear about a simple way to test how well AI helpers solve real challenges.

🛠️ Prepare your playground

You get everything ready on your computer so tests can run safely in their own space.

Pick your AI helper

🔮

Quick thinker (Gemini)

Use the fast AI that shines on everyday puzzles.

🧠

Careful planner (Claude)

Use the thoughtful AI great at step-by-step plans.

🎯 Select a challenge

You pick a task from the collection, like fixing code or following a workflow.

🚀 Run the tests

You launch multiple tries and watch the AI tackle the challenge in a safe bubble.

📊 Check the results

You see scores, success rates, and details on how well it did each time.

🏆 Know your AI's strengths

You now understand exactly what your AI excels at and where it can improve.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 22 to 22 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is skill-eval?

Skill-eval is a TypeScript framework for running unit tests on AI agent skills, treating tasks like self-contained coding challenges in Docker containers or local setups. Developers define tasks with instructions, reference solutions, and graders—mixing shell scripts for deterministic checks and LLM rubrics for qualitative scoring—then evaluate agents like Gemini CLI or Claude via a simple pnpm CLI: `pnpm run eval superlint --trials=10`. It spits out github unit test reports with pass rates, pass@k probabilities, normalized gain for skills impact, and browser previews of results, solving the pain of unreliable agent benchmarks.

Why is it gaining traction?

Unlike ad-hoc prompts or vague evals, it enforces skill evaluation criteria with weighted graders, auto-injects co-located skills for native discovery, and delivers github unit test coverage reports plus analytics on trial flakiness—directly from Anthropic's evals playbook. The CLI handles parallel trials, env var redaction for security, and suite workflows, making reproducible agent testing as straightforward as github unit test automation without custom scripts. Low barrier: Docker-ready tasks bootstrap in minutes.

Who should use this?

AI engineers building agentic workflows for devops or compliance tasks, like superlint enforcement. Teams running unit test github actions need skill evaluation test suites to verify Gemini/Claude reliability before production. Researchers benchmarking agent improvements via normalized gain metrics.

Verdict

Grab it for quick agent smoke tests—excellent docs, CLI ergonomics, and bootstrap validation make it usable now despite 22 stars and 1.0% credibility signaling early maturity. Scale up once more tasks land; pair with github unit test workflows for CI.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

10,362

Followers

Base stars: 22 stars

Bonus: AI verified quality (100%)

Account age: 5,608 days

Repo age: 6 days

Updated: Mar 04, 2026