anadim

Cited 83-model x 49-benchmark LLM evaluation matrix with 18 matrix completion methods

23
3
100% credibility
Found Feb 26, 2026 at 18 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A collection of AI model benchmark scores with a predictor tool to estimate missing results using advanced blending techniques.

How It Works

1
🔍 Discover the AI Score Predictor

You hear about a free tool that gathers real test scores for dozens of AI models on challenges like math, coding, and reasoning.

2
📋 Explore models and tests

Browse the list of popular AIs like GPT or Claude, and see tests covering knowledge, coding, math, and more.

3
🎯 Guess a missing score

Pick any model and test, like 'What would the latest GPT score on a coding challenge?', and instantly get a smart prediction.

4
Add your own AI
🚀
Quick lookup

See predictions for existing top models.

Test your model

Input known results and unlock estimates for everything else.

5
📊 Review your results

Get a complete set of predicted scores, accurate to within a few points on average.

Compare AIs easily

Now you can rank any model confidently, even with incomplete data, saving hours of research.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 18 to 23 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is llm-benchmark-matrix?

This Python project curates a cited 83-model x 49-benchmark matrix of LLM evaluation scores, every entry backed by a source URL, tackling the sparsity problem in LLM leaderboards. It applies 18 matrix completion methods to predict missing cells, with BenchPress—a logit-space blend of benchmark regression and SVD—hitting 7.2% median error on held-out data. Fire up the CLI with `python predict.py --model gpt-5.2` for instant scores, JSON output, or add your model via `--add-model my-llm --scores "mmlu=85,gpqa=70"`.

Why is it gaining traction?

Unlike raw spreadsheets or LLM predictors, it blends high-accuracy regression (6.9% error where possible) with universal SVD coverage (99%), outperforming Claude Sonnet on post-audit holdouts. Quick CLI lists models/benchmarks and handles cold starts from just a model name or 5 scores. Phase transition plots reveal exactly when algorithms eclipse memorized knowledge.

Who should use this?

LLM eval engineers filling gaps in custom matrices for model comparisons. Researchers prototyping imputation for new 49-benchmark suites. Product leads estimating unreleased model performance from sparse evals like GPQA or SWE-bench.

Verdict

Solid pick for benchmark completion—CLI shines, predictions clamp realistically, and cited data builds trust. 16 stars and 1.0% credibility signal early maturity, but excellent README and repro scripts make it dev-ready today.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.