stellarlinkco

The LLM Evaluation Framework

32
6
100% credibility
Found Feb 08, 2026 at 22 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Go
AI Summary

AI Eval is a comprehensive prompt evaluation and optimization system for LLM applications featuring multi-provider support, a rich suite of evaluators, benchmark datasets, CLI tools, a web API server, and SQLite storage.

Star Growth

See how this repo grew from 22 to 32 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is ai-eval?

ai-eval is a Go-built LLM evaluation framework for testing prompts against custom YAML test cases or standard llm evaluation datasets like MMLU, GSM8K, and HumanEval. It runs evals via CLI commands like `eval run --prompt example` or `eval benchmark --dataset mmlu`, supports providers like Claude and OpenAI, and applies metrics from basic regex/exact matches to llm_judge, factuality, RAG faithfulness, agent efficiency, and safety checks for toxicity or bias. Results land in SQLite with leaderboards, plus a web API and dashboard for tracking ai evaluation over time.

Why is it gaining traction?

Unlike barebones llm evaluation open source tools, ai-eval bundles CLI optimization (`eval optimize`), failure diagnosis (`eval diagnose`), and comparison (`eval compare --v1 v1 --v2 v2`) into one workflow, saving engineers from scripting evals themselves. The web server with REST endpoints for runs and history makes it an ai evaluation platform for teams, while benchmarks auto-save to leaderboards for quick model diffs. It's a full ai evals for engineers & pms kit without Python deps.

Who should use this?

LLM engineers iterating prompts for production apps, PMs benchmarking models on llm evaluation metrics, or ai evaluators in remote ai evaluation jobs needing reproducible runs. Ideal for red-teaming safety, RAG pipelines, or agent tool selection evals where custom test suites matter.

Verdict

Grab it if you need a lightweight, tested llm evaluation framework today—CLI shines for daily ai evals, docs cover evaluators and benchmarks well. At 24 stars and 1.0% credibility, it's early but Go-solid with high test coverage; fork or watch for maturity.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.