InternLM

An in-the-wild benchmark for AI agents in the OpenClaw Environment.

19
0
100% credibility
Found Mar 25, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

WildClawBench is a benchmark that evaluates AI agents on 60 practical end-to-end tasks in a real personal assistant setup, covering agency, multimodality, coding, safety, and more.

How It Works

1
🔍 Discover WildClawBench

You stumble upon this tough challenge that tests if AI helpers can handle real-life jobs like summarizing papers or clipping video highlights.

2
📥 Grab the test kit

Download the ready-made playground and all the task examples from a simple sharing site.

3
🎥 Ready the examples

Fetch videos, papers, and puzzles so everything is set for the challenges.

4
🔗 Link an AI brain

Connect a smart AI service like Claude or GPT so it can tackle the tasks.

5
▶️ Run the challenges

Pick a group of tasks or run them all to see your AI in action on real work.

6
📊 Check the scores

Watch as it automatically grades each task and tallies up the results.

🏆 Join the leaderboard

See your AI's ranking against top models and share your lobster's performance if customized.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is WildClawBench?

WildClawBench is an in-the-wild benchmark that evaluates AI agents on 60 hand-crafted, end-to-end tasks inside a live OpenClaw personal assistant environment. It tests real-world skills like multi-step video clipping from football matches, negotiating emails, writing inference code for undocumented repos, and detecting safety leaks – all using actual tools like browser, bash, and email in Docker containers for reproducibility. Built in Python, you download the image and task data from HuggingFace, prep with bash scripts, then run evals via CLI on OpenRouter models, getting scores, costs, and logs automatically.

Why is it gaining traction?

Unlike toy benchmarks with mocked APIs, this runs agents in a genuine OpenClaw setup with no data leakage, stressing long-horizon planning, multimodal synthesis, and error recovery where even top models like Claude Opus score under 51%. The interactive leaderboard and "personal OpenClaw" mode let you submit tuned agents for bragging rights, while Docker isolation ensures bit-for-bit reproducibility across machines. It's a no-fluff in-the-wild benchmark that exposes true agent gaps in distribution shifts.

Who should use this?

AI researchers benchmarking LLMs for agentic apps, OpenClaw users iterating on custom skills and personalities, and teams evaluating models for production workflows like code intel or safety alignment. Ideal for studying in-the-wild multimodal tasks or safety in real envs, without setup headaches.

Verdict

Grab it if you're serious about agent evals – docs are thorough, CLI is straightforward, and results are trustworthy despite low maturity (19 stars, 1.0% credibility). Early days, but scales well for custom lobster comparisons.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.