boheling

boheling / skillbench

Public

Adversarial, deterministic benchmarking for AI agent skills — measures whether a skill makes the agent measurably better. Includes biomedical eval cases.

13
2
69% credibility
Found May 30, 2026 at 13 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

SkillBench is a testing tool that measures whether an AI skill actually helps an assistant perform better. It compares two versions of an AI solving the same tasks - one with the skill loaded, one without - and uses automatic rules to grade their outputs. The tool also scans for security problems like hardcoded secrets or dangerous commands. It produces clear reports with scores across four areas: correctness, security, completeness, and whether the skill meaningfully improved results over a baseline. The project includes pre-built test cases for biomedical work like BLAST searches, protein structure analysis, drug interaction checks, and genetic variant interpretation.

How It Works

1
🔬 You have a skill you built for an AI assistant

You created a SKILL.md file with instructions for tasks like analyzing genes, finding protein structures, or checking drug interactions.

2
🛡️ You scan for hidden security problems

SkillBench checks your skill files for risky patterns like hardcoded passwords, dangerous commands, or attempts to send your data somewhere else.

3
⚖️ You set up a head-to-head comparison

SkillBench creates two parallel test runs: one AI gets your skill loaded, the other solves the same tasks without any guidance.

4
🎯 Both versions work through your test cases

Each AI tackles the same challenges - retrieving protein structures, interpreting genetic variants, checking drug safety - and saves their results.

5
How do you want to measure success?
📊
Quick scan

Run a security-only check if you just want to find hidden vulnerabilities

🏆
Full benchmark

Run a complete comparison if you want to know if your skill actually helps

6
📈 You see the scores and what needs work

A clean report shows you four scores: correctness, security, completeness, and whether having your skill actually beat the baseline.

You know if your skill is worth keeping

Scores of 75+ mean your skill is recommended, 50-74 means acceptable, below 50 means it needs improvement before relying on it.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 13 to 13 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is skillbench?

Skillbench is an adversarial benchmarking tool for AI agent skills. You give it a skill (like a BLAST search helper or drug interaction checker), and it runs two parallel agents on the same tasks -- one with the skill loaded, one without. It then scores whether the skill actually added value using deterministic assertions, not LLM judges. The tool is built in Python and ships as a CLI with commands for scanning, running benchmarks, grading results, and generating HTML reports. Pre-built eval cases cover biomedical tasks like protein folding queries, VCF interpretation, and DESeq2 workflows.

Why is it gaining traction?

The key insight is that LLMs hallucinate when grading their own work. Skillbench sidesteps this by using deterministic checks -- string matching, JSON validation, word counts -- to score skill quality. The scoring rubric breaks down into correctness, security, completeness, and robustness, with a clear threshold: 75+ is recommended, below 50 needs work. There's no vibes-based assessment here, which appeals to teams tired of subjective eval frameworks. The focus on reproducibility and the fact that it actually compares against a no-skill baseline makes the results meaningful rather than vanity metrics.

Who should use this?

AI developers building skill repositories for agent toolkits should use this to validate whether their skills hold up under pressure. ML teams evaluating third-party skills can use it as a standardized audit before deployment. Bioinformatics groups integrating AI agents into research workflows can leverage the pre-built biomedical eval cases to benchmark domain-specific skills. If you're shipping agent skills in clinical, genomics, or drug discovery contexts, this gives you defensible quality numbers.

Verdict

Skillbench is a principled approach to a real problem, but it carries the weight of early-stage software. The credibility score sits at 0.7%, and with only 13 stars, community validation is minimal. Documentation is present but sparse, and test coverage is unclear from what's visible. The biomedical eval libraries are a practical head start, and the deterministic grading philosophy is sound. Use it for internal benchmarking where you control the scope, but treat it as a foundation to build on rather than a production-grade evaluation system. Keep expectations calibrated until the project gains traction.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.