allenai / vla-evaluation-harness
PublicOne framework to evaluate any VLA model on any robot simulation benchmark.
A framework for testing vision-language-action AI models on robot simulation benchmarks like LIBERO and CALVIN, with Docker reproducibility, parallel evaluation, and a public leaderboard.
How It Works
You hear about a friendly tool that lets anyone test how smart robot brains are at everyday tasks like stacking blocks or opening drawers.
Download and prepare everything with simple steps, like unpacking a gift – no tech headaches.
Pick a smart AI model, like one that understands pictures and words to guide robot arms.
Choose fun robot jobs, such as picking objects or navigating kitchens, to see real skills.
Launch your chosen brain on the challenges and watch virtual robots try tasks in simulated worlds.
Get clear reports on how well the robot did, with success rates for each job.
Compare results with top models worldwide and celebrate your robot's performance.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.