ytang928

ytang928 / BrainBench

Public

BrainBench: A 100-question benchmark exposing commonsense reasoning gaps in LLMs across 20 failure categories. Includes English and Chinese datasets, evaluation code, and results for 8 frontier models.

41
10
100% credibility
Found Mar 22, 2026 at 41 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

BrainBench is a dataset of brainteaser questions and evaluation tools to benchmark large language models' commonsense reasoning abilities against human-level performance.

How It Works

1
📚 Discover BrainBench

You stumble upon BrainBench, a collection of tricky brainteasers that test if AI can think like humans on everyday puzzles.

2
🧩 Grab the Puzzles

Download the ready-made set of 100 clever questions and their correct answers to challenge any AI.

3
🔗 Connect Your AIs

Link up your favorite AI chat services, like ChatGPT or Claude, so they can join the puzzle challenge.

4
▶️ Launch the Test

Pick which AIs to test and start running them through the brainteasers, watching progress as they respond.

5
Wait for Results

Give it a little time while the AIs tackle multiple rounds of each puzzle for fair scoring.

6
📊 See the Rankings

Get colorful charts, accuracy scores, and breakdowns showing which AI solved the most puzzles correctly.

🎉 Share Your Insights

Celebrate understanding AI's reasoning strengths and blind spots, ready to discuss or use in your projects.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 41 to 41 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is BrainBench?

BrainBench runs a 100-question benchmark across 20 commonsense reasoning categories to expose failure modes in LLMs, like implicit physical constraints or wrong vantage points—traps humans dodge but models hit hard. It provides English and Chinese datasets in JSON, plus Python evaluation code that queries frontier models via OpenAI, Anthropic, Google, or OpenRouter APIs, judges responses automatically, and aggregates accuracy plus reliability scores. Unlike generic evals, it targets brainteasers where LLMs flop despite scale.

Why is it gaining traction?

It stands out with bilingual support and "thinking" mode tests for models like Claude Opus or GPT-5.4, generating plots on category breakdowns and consistency—ideal for spotting real reasoning gaps beyond leaderboards. The CLI lets you benchmark any model in minutes with resume support, multiple runs, and analysis scripts for publication-ready visuals, hooking devs tired of vague MMLU scores.

Who should use this?

AI researchers benchmarking LLMs for papers, like brainbench technology evals on commonsense datasets. LLM teams at startups testing deployment candidates across English/Chinese. Model trainers diagnosing category-specific flaws before fine-tuning.

Verdict

Grab it if you're evaluating LLMs—solid docs, MIT license, and ready-to-run CLI make it useful now, despite 41 stars and 1.0% credibility signaling early maturity. Run your models; the results will reveal more than hype. (187 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.