petergpt

BullshitBench measures whether AI models challenge nonsensical prompts instead of confidently answering them, created by Peter Gostev.

349
21
100% credibility
Found Feb 27, 2026 at 295 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

BullshitBench is an open-source benchmark evaluating large language models' ability to detect, reject, and avoid engaging with nonsensical or invalid premises.

How It Works

1
🔍 Discover BullshitBench

You stumble upon BullshitBench, a fun tool that tests if AI chatbots can spot total nonsense and call it out.

2
👀 Check the public leaderboard

Visit the online viewer to see how popular AI models score at rejecting silly or broken ideas.

3
🔗 Link your AI account

Connect a service like OpenRouter so the tool can chat with different AI models on your behalf.

4
▶️ Launch the full test

Run the simple one-click process to test a bunch of AI models against tricky nonsense questions.

5
⚗️ Watch the magic happen

The tool asks each AI nonsense questions and grades if they push back clearly or fall for it.

6
📊 Review your custom results

Open the local viewer or published page to see charts, scores, and which models did best.

🎉 Share your insights

You've got fresh data on smart AI detection—brag about it or use it to pick better chatbots!

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 295 to 349 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is bullshit-benchmark?

BullshitBench, created by Peter Gostev, is a Python benchmark that measures whether AI models challenge nonsensical prompts instead of confidently answering them. It scores responses across clear pushback (full rejection), partial challenge (flagging issues but engaging), or accepted nonsense (treating BS as valid). Developers get a full pipeline: set an OpenRouter API key, run the end-to-end script, and view interactive results in a public or local HTML dashboard with leaderboards and charts.

Why is it gaining traction?

It stands out by targeting a real LLM weakness—hallucinating on absurd inputs—unlike broad benchmarks focused on math or coding. The hook is instant results via OpenRouter for dozens of models, plus a polished viewer that ranks them by bullshit resistance, making it dead simple to compare Claude, GPT, or Mistral variants. With 286 stars, devs share scores on HN-style threads, fueling quick experiments.

Who should use this?

AI engineers tuning prompts for production apps, where nonsense inputs crash reliability. Researchers evaluating model safety before deployment. Prompt hackers testing if new releases actually "reason" better on edge cases like impossible physics queries.

Verdict

Grab it if you're benchmarking LLMs—solid docs and one-command runs make it practical despite modest 286 stars and 1.0% credibility score signaling early maturity. Polish tests for stability, but it's already sharper than ad-hoc evals.

(178 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.