tolitius

tolitius / cupel

Public

separates precious LLMs from base LLMs

17
0
100% credibility
Found Apr 09, 2026 at 17 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
JavaScript
AI Summary

Cupel is a user-friendly benchmarking tool that lets you test and rank AI language models on custom prompts using a web dashboard with leaderboards, charts, and automatic scoring by a judge model.

How It Works

1
📦 Get cupel running

Download and start the app with a simple command, and it opens a friendly web page in your browser.

2
See sample rankings

Right away, you see a leaderboard comparing AI helpers with scores and speed charts using example tests.

3
🔍 Find your AI helpers

The app automatically spots any AI services running on your computer and lists their smart models ready to test.

4
🎯 Choose tests and models

Pick categories of challenges like math or coding, select your AI models, and optionally connect cloud services.

5
▶️ Run the benchmark

Hit go, watch live progress as each model tackles the tests, with times and updates in real-time.

6
⚖️ Scores roll in

A smart judge reviews answers on a 0-3 scale, adding reasoning why each got full marks or not.

🏆 Discover the winners

Enjoy colorful charts, leaderboards, and breakdowns showing which AI shines brightest for your needs.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 17 to 17 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is cupel?

cupel is a Python benchmarking tool with a snappy web UI that scores local and cloud LLMs on custom prompt sets, using a configurable judge model to rate responses on a 0-3 rubric with reasoning. Run `pip install cupel; cupel` to launch a dashboard at localhost:8042, auto-discover servers like Ollama or LM Studio, and get instant leaderboards plotting accuracy vs. tokens-per-second. It handles multi-turn chats, tool calls, and thinking blocks, separating precious LLMs from base ones via cupellation-style evals.

Why is it gaining traction?

Zero-setup example data populates your first leaderboard, LLM-assisted prompt authoring drafts tests and rubrics in seconds, and vanilla JavaScript frontend means no build hassles—just pure browser speed. CLI commands like `cupel run --models Qwen3.5-27B` or `cupel judge *.json` give scripted power, while SSE progress updates make long runs feel responsive. Devs dig the YAML config for OpenAI/Anthropic/OpenRouter endpoints with pricing fetches.

Who should use this?

Quantization tinkerers benchmarking 4-bit vs. 8-bit local models on oMLX or SGLang. AI researchers comparing cloud giants like Claude Opus against Ollama runners. Teams needing quick category fingerprints (math, coding, systems) before deploying to production.

Verdict

Grab it for local LLM evals—docs shine, UI delivers, Apache 2.0 is dev-friendly—but 17 stars and 1.0% credibility signal beta risks like sparse tests. Solid for hobby rigs; scale cautiously.

(187 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.