interactivebench

Official Project Page for Interactive Benchmarks

19
0
100% credibility
Found Mar 08, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

InteractiveBench provides interactive benchmarks to evaluate AI models on math reasoning, situation puzzles, trust games, and poker through simulated competitions.

How It Works

1
🔍 Discover InteractiveBench

You find this fun collection of AI challenges on a code sharing site, perfect for testing how smart different AIs are at puzzles, math, games, and trust dilemmas.

2
🔗 Connect your AI friends

Link up your favorite AI models from a service so they can join the challenges and show their skills.

3
🎯 Pick a challenge

Choose what to test them on, like tricky math problems, brain-teaser situations, poker showdowns, or trust games.

4
🚀 Launch the showdown

Hit start and watch the AIs compete head-to-head, asking questions, solving puzzles, or bluffing in games.

5
📊 See the results roll in

Check live updates, scores, graphs, and stats as each AI battles it out round by round.

🏆 Crown the champions

Review who won, their strategies, and insights into which AI thinks deepest in interactive scenarios.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is InteractiveBench?

InteractiveBench is a Python benchmark suite for evaluating LLMs in interactive scenarios, straight from the official GitHub repository tied to a research paper on interactive benchmarks. It runs math evals comparing direct solving against yes/no Q&A loops with pass@k baselines, situation puzzles for reasoning, trust game tournaments pitting LLMs against baselines, and multi-table Texas Hold'em poker where six models battle in real-time. Set up via OpenRouter's OpenAI-compatible API, it spits out reproducible results with resume support and detailed logs.

Why is it gaining traction?

It ditches static prompts for true multi-turn interactions, letting you measure if models improve via clarification questions—think turtle-slow probing vs. naive guesses. The poker sim stands out: spin up 10 parallel tables, log NDJSON stats, and analyze with ready Python scripts for bar charts and trends. As the official project website, bilingual READMEs and shell scripts make batch evals painless.

Who should use this?

LLM researchers benchmarking reasoning agents on non-trivial interactions, AI safety engineers probing trust dynamics in games, or game devs stress-testing poker bots against top models like Grok or Qwen. Ideal for anyone needing quick, comparable metrics beyond leaderboards.

Verdict

Grab it for interactive LLM evals—MIT licensed, solid docs, but at 19 stars and 1.0% credibility, it's early-stage; watch official GitHub releases page for polish. Fork and contribute if benchmarks are your jam.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.