reacher-z

Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, 15 categories.

16
1
100% credibility
Found Apr 12, 2026 at 18 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

ClawBench is a benchmark that tests AI agents on 153 everyday web tasks across 144 live sites in 15 life categories, recording multi-layer sessions in isolated environments for evaluation.

How It Works

1
🔍 Discover ClawBench

You find ClawBench, a fun way to test if AI helpers can handle everyday online chores like ordering food or applying for jobs.

2
⚙️ Get ready

You prepare by linking your favorite AI thinkers and a simple email service so tests can use real-looking info.

3
📋 Pick a challenge

You choose from 153 real-life tasks, like booking a trip or writing a review, and select which AI to try.

4
🚀 Watch it go

With one click, your AI jumps into a safe, private browser to tackle the task just like you would.

5
📹 See the full story

You get videos, screenshots, and logs showing every click, form fill, and what happened step by step.

🏆 Check the score

You learn if the AI succeeded, compare leaderboards, and share results to help improve everyone's AI buddies.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 18 to 16 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is ClawBench?

ClawBench benchmarks AI agents on 153 real-world online tasks across 144 live websites in 15 everyday categories like ordering food on Uber Eats, booking Airbnb, or applying to jobs on LinkedIn. It spins up isolated Docker containers with Chromium, lets agents like those from agents github claude code or agents github copilot drive the browser, intercepts final actions to avoid real side effects, and records five synchronized layers: video, screenshots, HTTP traffic, DOM events, and agent transcripts. Developers get CLI/TUI tools to run single tasks, batches, or human baselines, plus evals comparing agent vs. human performance.

Why is it gaining traction?

Unlike synthetic benchmarks, ClawBench hits live sites with bot detection, CAPTCHAs, and dynamic UIs—revealing frontier models top out at 33% success (Claude Sonnet 4.6 leads). The ClawBench-Lite subset (20 curated tasks on household names like DoorDash, GitHub, LeetCode) delivers quick signals at low cost, matching github agents examples formats. Strong leaderboard, arXiv paper, and prompts for agents github claude make it dead simple to test your own models.

Who should use this?

AI researchers tuning agents for web automation, teams evaluating LLMs like Claude or Copilot for production browser tasks (ai agents complete guide seekers), or devs building github agents sdk prototypes. Perfect for anyone benchmarking "agents of shield complete series"-style reliability on tasks from job apps to pet sitting on Rover.

Verdict

Grab it if you're serious about agent evals—docs shine, TUI/CLI just work, Python+Docker setup is solid despite 16 stars and 1.0% credibility score signaling early days. Run Lite first to validate; full 153 tasks expose real gaps worth fixing.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.