XChen-Zero

XChen-Zero / OneEval

Public

OneEval: Open EvalScope evaluation artifacts for LLMs — subset breakdowns, pass@k curves, and reproducible evaluation protocols.

17
0
100% credibility
Found Mar 05, 2026 at 17 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

OneEval releases detailed, auditable evaluation results and artifacts for open large language models across knowledge, agentic, instruction-following, and reasoning benchmarks via a browsable static website.

How It Works

1
🔍 Discover OneEval

You stumble upon OneEval while looking for trustworthy AI model test results and head to the website.

2
🌐 Explore categories

Pick from Knowledge, Agents, Instructions, or Reasoning sections to focus on what interests you.

3
📊 Scan score tables

Browse simple tables ranking models by their performance in easy-to-read formats.

4
Dive deeper?
📈
View breakdowns

See subset scores or task details to understand strengths and weaknesses.

📉
Check curves

Watch pass rates improve over multiple tries on tough problems.

5
📋 Review test rules

Glance at summaries of how tests were set up for fairness and repeatability.

Master the results

You walk away with clear, reliable insights into which models shine where.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 17 to 17 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is OneEval?

OneEval releases detailed evaluation artifacts from EvalScope runs on LLMs, focusing on benchmarking knowledge-intensive reasoning over diverse knowledge bases like GPQA Diamond, MMLU-Pro, and AIME math problems. Developers get subset breakdowns, pass@k curves, and reproducible protocols via a static Python-powered site and raw results tree—no composite scores, just auditable slices for knowledge, agentic (BFCL v3), instruction-following (IFEval), and reasoning tasks. Preview locally with a simple HTTP server to drill into interactive curves and QA breakdowns.

Why is it gaining traction?

Unlike opaque leaderboards, OneEval exposes exact sampling knobs, run repetitions, and runtime setups alongside rich visuals like pass@k milestones (k=1/8/32/64), making it dead simple to verify claims on benchmarks like Super GPQA or ZebraLogicBench. The emphasis on artifacts over rankings hooks devs tired of underspecified evals, with monkey-patched scripts for Qwen3/Llama reproduction using SGLang. It's a breath of fresh air for reproducible LLM evaluation without the hype.

Who should use this?

LLM researchers auditing Qwen3 or Llama series on agentic tasks via BFCL v3, or math/reasoning evals like AIME24/25. Model teams comparing subset performance on MMLU-Pro domains or GPQA without rerunning everything. Eval framework users extending EvalScope who need protocol templates for their own knowledge QA pipelines.

Verdict

Grab the artifacts if you're deep into Qwen/Llama benchmarking—reproducible protocols and breakdowns are gold despite 17 stars and 1.0% credibility signaling early maturity. Skip if you need a full eval suite; it's artifacts-first, not a turnkey tool.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.