Lomnus-ai

Official implementation of the BrainBench Dataset

91
8
100% credibility
Found Mar 31, 2026 at 91 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

BrainBench is a dataset of brainteaser questions and evaluation tools that test large language models on commonsense reasoning failures humans rarely make.

How It Works

1
🔍 Discover BrainBench

You stumble upon BrainBench, a clever set of brainteasers designed to reveal where smart AI assistants trip up on everyday reasoning that humans handle easily.

2
📥 Grab the puzzle pack

You download the collection of 100 tricky questions in English or Chinese, complete with answers and category explanations.

3
🔗 Pick your AI testers

You choose which popular AI thinkers like Claude or GPT to test by linking them up simply.

4
🚀 Launch the challenge

With one go, you run the full test, letting each AI tackle the puzzles several times while a fair judge checks their answers.

5
Follow the progress

You watch as the tests complete, building up results safely without overwhelming your setup.

6
📊 See the rankings

Beautiful charts and tables appear, ranking AIs by accuracy and reliability, highlighting tough categories like hidden constraints.

🎉 Uncover AI secrets

You now have clear insights into AI reasoning strengths and gaps, perfect for sharing in reports or discussions.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 91 to 91 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is BrainBench?

BrainBench is a Python tool and dataset that benchmarks large language models on 100 brainteasers—riddles humans solve instantly but LLMs fumble due to reasoning gaps. You feed it questions via OpenAI, Anthropic, or Google APIs, and it runs multiple trials, auto-judges answers with another LLM, then outputs accuracy, reliability scores, and per-category breakdowns across English and Chinese versions. Unlike generic evals, it targets 20 specific traps like implicit physical constraints or semantic scope tricks.

Why is it gaining traction?

It stands out by quantifying exact failure modes with leaderboards showing Claude Opus topping GPT-5 at 80% vs 74%, plus plots and analysis scripts for quick insights—no setup hassle beyond API keys and a simple CLI like `python run_benchmark.py --model claude-opus-4.6`. Devs dig the resume-friendly runs, concurrent requests, and MIT license for easy forking, beating scattered one-off tests.

Who should use this?

AI researchers ranking model reasoning before deployment, prompt engineers debugging heuristic biases in production LLMs, or teams comparing OpenAI vs Anthropic on commonsense tasks. Ideal for brainbench certification-style evals in Python or C++ fundamentals, but focused on LLM tech like official GitHub actions for reproducible releases.

Verdict

Grab it if you're evaluating LLMs—solid docs and CLI make it dead simple, despite 91 stars and 1.0% credibility signaling early maturity. Run a quick test today; scale to full benchmarks once you verify your API costs.

(187 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.