SeraphimSerapis / tool-eval-bench
PublicTool-calling quality benchmark for LLM serving stacks. 65+ deterministic scenarios testing multi-turn orchestration, safety boundaries, and structured output. Supports vLLM, LiteLLM, and llama.cpp.
tool-eval-bench is a benchmark tool that tests AI models' ability to select and use tools correctly in multi-step conversations across various AI servers.
How It Works
You hear about a simple way to check if AI helpers can use everyday tools like weather checkers or email senders correctly.
With one easy command, you add this tester to your computer, like installing a helpful app.
You tell the tester where your AI lives by sharing its web address, so it can chat with it.
Click to start a quick test — watch as it asks your AI simple questions and sees if it picks the right tools.
Get a colorful report showing stars like excellent or good, with breakdowns of what went right or wrong.
Run all challenges to measure tool smarts deeply.
Check thinking speed alongside tool accuracy.
Save results, compare different AIs, and build a leaderboard to pick the best one.
Celebrate knowing exactly how well your AI handles tools, ready for real conversations.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.