humanlaya

Official evals for $OneMillion-Bench

25
1
100% credibility
Found Mar 13, 2026 at 25 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

An automated evaluation tool that tests AI language models on 400 professional-domain questions using weighted rubrics, judge models, concurrent processing, cost tracking, and Excel/JSON reports.

How It Works

1
🔍 Discover the benchmark

You hear about a tool that tests how well AI experts handle real-world questions in medicine, finance, law, engineering, and science.

2
📥 Get the tool and questions

Download the simple program and the set of 400 challenging questions with scoring guides.

3
🔗 Link your AI services

Connect a few AI thinking services so they can answer questions and judge responses.

4
⚙️ Choose your test AIs

Pick which smart AIs to evaluate and which ones will check the answers fairly.

5
🚀 Launch the tests

Hit go, and watch as answers get created, scored automatically, and costs tracked in real-time.

6
📊 Review colorful reports

Open beautiful spreadsheets and summaries showing scores, strengths, and exact costs for each AI.

🏆 Pick the top expert

See clear rankings to choose the best AI for your professional needs, with full details saved.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 25 to 25 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is OneMillion-Bench?

OneMillion-Bench is the official GitHub repository for $OneMillion-Bench, a Python CLI tool that automates rubric-based evals for language agents on tough professional tasks. It benchmarks 50+ models across six providers like OpenRouter, Qwen, and VolcEngine against 400 bilingual questions in healthcare, finance, law, industry, and sciences. Users get weighted binary grading, async runs up to 128 concurrent, cost tracking, and Gruvbox-themed Excel/JSON reports via simple `omb eval` commands.

Why is it gaining traction?

It stands out with built-in support for Chinese models and providers, repeated sampling for variance, web search augmentation, and auto cost breakdowns—features missing in generic evals like Hugging Face Open LLM Leaderboard. The official GitHub releases page and actions make setup fast: clone, pip install, download dataset from HF, set env API keys, run. High concurrency and rich outputs hook teams scaling model comparisons.

Who should use this?

AI researchers benchmarking LLMs for expert domains like clinical oncology or M&A law. Teams at finance firms or pharma companies evaluating agents before deployment. Devs needing standardized, bilingual evals with human score alignment, beyond toy benchmarks.

Verdict

Solid for $OneMillion-Bench evals despite 25 stars and 1.0% credibility score—docs are thorough, Makefile handles dev workflow, Apache 2.0 license. Maturity shows in CLI robustness and reports, but low adoption means watch for edge cases; try on a domain subset first.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.