WildClawBench is a benchmark that evaluates AI agents on 60 practical end-to-end tasks in a real personal assistant setup, covering agency, multimodality, coding, safety, and more.
How It Works
You stumble upon this tough challenge that tests if AI helpers can handle real-life jobs like summarizing papers or clipping video highlights.
Download the ready-made playground and all the task examples from a simple sharing site.
Fetch videos, papers, and puzzles so everything is set for the challenges.
Connect a smart AI service like Claude or GPT so it can tackle the tasks.
Pick a group of tasks or run them all to see your AI in action on real work.
Watch as it automatically grades each task and tallies up the results.
See your AI's ranking against top models and share your lobster's performance if customized.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.