darkrishabh

A test runner for agentskills.io-style AI agent skills

102
3
100% credibility
Found May 07, 2026 at 102 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
TypeScript
AI Summary

A testing tool for Agent Skills that runs AI model prompts with and without a skill, uses a judge model to score outputs, and generates benchmark reports.

How It Works

1
📰 Discover the eval tool

While building smart AI agent skills, you find this handy tester that proves if they really help AI think better.

2
📁 Prepare your skill folder

Put your skill guide and a few test prompts with expected results into a simple folder.

3
🔗 Link your AI service

Pick an AI brain like GPT and let the tool connect so it can run smart tests.

4
▶️ Kick off the test

Tell the tool to check your folder, and it quietly runs tests twice—once with your skill, once plain.

5
Wait for results

Sit back as a smart judge reviews each test, grading success with clear reasons.

📊 Celebrate with your report

Open the beautiful webpage report showing pass rates, side-by-side proofs, and hard evidence your skill shines!

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 102 to 102 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is agent-skills-eval?

agent-skills-eval is a TypeScript CLI and SDK for running agent skills evals on the agentskills.io standard. Point it at a folder of SKILL.md files with evals.json, fire up `npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline`, and it tests each eval twice—once loading the skill into context, once as baseline—then uses a judge model to grade outputs with assertions and tool calls. You get static HTML reports, JSONL artifacts, and benchmarks proving if your skill lifts performance, like a test runner for Jest but tuned for AI agent skills evaluation.

Why is it gaining traction?

It delivers empirical receipts on skill impact via with/without comparisons, pass/fail grading with evidence, and portable workspaces for diffing iterations—no vibes, just data. OpenAI-compatible out of the box (Groq, Anthropic, local servers), plus custom providers and YAML configs for CI. Feels like testing GitHub workflows locally: quick CLI spins up evals, SDK hooks into pipelines, and reports drop anywhere without infra.

Who should use this?

AI agent builders crafting SKILL.md for tasks like CSV trends or code reviews, needing agent skills evals before production. Prompt engineers at startups validating domain boosts on gpt-4o-mini. Teams wanting GitHub Actions-style testing for agent skills, with local runs mirroring prod judges.

Verdict

Try it for agent skills evaluation if agentskills.io is your stack—CLI simplicity and reports nail the workflow. 102 stars and 1.0% credibility score mean it's nascent, but docs, examples, and built-in CI coverage make it production-ready enough for early adopters.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.