XChen-Zero / OneEval
PublicOneEval: Open EvalScope evaluation artifacts for LLMs — subset breakdowns, pass@k curves, and reproducible evaluation protocols.
OneEval releases detailed, auditable evaluation results and artifacts for open large language models across knowledge, agentic, instruction-following, and reasoning benchmarks via a browsable static website.
How It Works
You stumble upon OneEval while looking for trustworthy AI model test results and head to the website.
Pick from Knowledge, Agents, Instructions, or Reasoning sections to focus on what interests you.
Browse simple tables ranking models by their performance in easy-to-read formats.
See subset scores or task details to understand strengths and weaknesses.
Watch pass rates improve over multiple tries on tough problems.
Glance at summaries of how tests were set up for fairness and repeatability.
You walk away with clear, reliable insights into which models shine where.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.