Bent-Solutions

Local benchmarking UI for LLMs and AI agents

12
0
100% credibility
Found Apr 19, 2026 at 12 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
JavaScript
AI Summary

Hermes Bench is a self-hosted web application for benchmarking local large language models and AI agents with customizable tasks, automated judging, and result comparisons.

How It Works

1
🔍 Discover Hermes Bench

You find a tool to easily test how well your local AI models handle real tasks like coding, searching, and reasoning.

2
🚀 Launch with one click

Run a simple starter script and open your web browser to see the friendly dashboard ready to go.

3
🖥️ Spot your models and setup

The app automatically finds your AI models, checks your computer's power, and lets you start test servers if needed.

4
Pick models and run benchmark

Choose which AI brains to test, select a ready-made challenge set, and hit start to watch them tackle tasks live.

5
📊 Watch results roll in

See side-by-side scores, pass/fail checks, tool usage logs, and smart judging that tells you who's best.

🏆 Compare and improve

Review detailed reports, export findings, and tweak setups to make your local AIs smarter over time.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 12 to 12 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is hermes-bench?

Hermes-bench is a self-hosted web UI for benchmarking local LLMs and AI agents like Hermes profiles or llama.cpp servers, running configurable task suites with real-time progress tracking. Developers get side-by-side comparisons of outputs, timings, tool calls, and LLM-as-judge scoring—all without cloud services—via a React frontend and FastAPI Python backend. Fire it up with one command, Docker, or manual setup, and export results as JSON or Markdown.

Why is it gaining traction?

It stands out for fully local benchmarking of Hermes agents and raw llama.cpp endpoints, with built-in GPU detection, model scanning, server spin-up, and custom task creation. The interactive tutorial, session isolation to avoid contamination, and WebSocket live updates make iterating on local LLMs feel snappy, unlike clunky CLI tools or cloud-dependent alternatives. Early adopters praise the hermes benchmark integration for agentic tasks like tool use and delegation.

Who should use this?

AI researchers tuning Hermes profiles or llama.cpp configs for local deployment. Devs building local LLM apps who need quick hermes 3 benchmark runs on custom tasks like reasoning or coding. Hardware tinkerers evaluating GPU setups with side-by-side local llm comparisons.

Verdict

Worth a spin for local benchmarking meaning business—solid docs, Docker support, and intuitive UI despite 12 stars and 1.0% credibility score signaling early maturity. Fork it if you need more tasks; production users may want tests first.

(187 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.