gameworld-project

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

19
0
100% credibility
Found Apr 16, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

GameWorld is a benchmark that tests AI agents on playing 34 browser-based games by analyzing screenshots and using keyboard or mouse controls.

How It Works

1
🔍 Discover GameWorld

You hear about this fun benchmark where smart AIs try to play classic browser games like 2048, Flappy Bird, and Snake.

2
🛠️ Set up your playground

Create a cozy spot on your computer and link it to a few clever AI helpers so they can see and control the games.

3
🎮 Watch AI tackle its first game

Choose a simple game like 2048, pick an AI teammate, and see it slide tiles, merge numbers, and chase high scores right before your eyes.

4
📊 Challenge AIs across many games

Run tests on dozens of games at once, pitting different AIs against puzzles, runners, and platformers to find the champions.

5
📈 Follow the action live

Peek at a dashboard showing real-time scores, funny mistakes, and video replays of every dramatic moment.

🏆 Crown the ultimate game AI

Celebrate with charts and videos revealing which AI dominated the leaderboard and conquered the most games.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is gameworld?

GameWorld is a Python framework for benchmarking multimodal agents—like Claude, Gemini, GPT, and Qwen variants—across 34 classic browser games such as 2048, Flappy Bird, and Pac-Man, tackling 170 tasks with outcome-based scoring. It launches games via Playwright, feeds screenshots to agents, executes actions like clicks and keypresses, and verifies success through game states, producing video replays and HTML dashboards. Developers get a quick CLI to run single tests (`python main.py --config 01_2048+gpt-5.2`) or parallel suites, pushing towards standardized, verifiable evals for gameworld agents.

Why is it gaining traction?

Unlike ad-hoc agent demos, GameWorld enforces verifiable metrics on real browser environments, supporting both semantic controls and cutting-edge computer-use APIs for precise interactions. YAML configs for games, tasks, and models make it dead simple to swap agents or scale benchmarks, with Discord for sharing results. Early adopters praise the replay artifacts for debugging agent failures in game world scenarios.

Who should use this?

AI researchers evaluating vision-language models for UI navigation or game AI, especially those testing Qwen or Claude on multimodal tasks. Agent builders comparing "computer use" previews across providers, or teams needing reproducible baselines for browser-based agents beyond toy envs like 2048.

Verdict

Worth forking for multimodal agent evals—solid docs, quick start, and arXiv backing make it usable now despite 19 stars and 1.0% credibility signaling early maturity. Prioritize if you're in verifiable gameworld benchmarking; otherwise, watch for more community runs.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.