allenai / vla-evaluation-harness

Public

One framework to evaluate any VLA model on any robot simulation benchmark.

100% credibility

Found Mar 20, 2026 at 45 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

A framework for testing vision-language-action AI models on robot simulation benchmarks like LIBERO and CALVIN, with Docker reproducibility, parallel evaluation, and a public leaderboard.

How It Works

🔍 Discover Robot AI Tests

You hear about a friendly tool that lets anyone test how smart robot brains are at everyday tasks like stacking blocks or opening drawers.

📦 Easy Setup

Download and prepare everything with simple steps, like unpacking a gift – no tech headaches.

🤖 Choose Your Robot Brain

Pick a smart AI model, like one that understands pictures and words to guide robot arms.

🏭 Select a Challenge

Choose fun robot jobs, such as picking objects or navigating kitchens, to see real skills.

🚀 Run the Tests

Launch your chosen brain on the challenges and watch virtual robots try tasks in simulated worlds.

📊 See Your Scores

Get clear reports on how well the robot did, with success rates for each job.

🏆 Join the Leaderboard

Compare results with top models worldwide and celebrate your robot's performance.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 45 to 45 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is vla-evaluation-harness?

This Python framework lets you evaluate any Vision-Language-Action (VLA) model on robot simulation benchmarks like LIBERO, CALVIN, or ManiSkill2 with one command. Benchmarks run in Docker containers for zero dependency conflicts, while models serve actions over WebSocket from simple uv scripts—no more private eval forks per benchmark. Install via pip, spin up a model server with `vla-eval serve --config openvla.yaml`, then benchmark with `vla-eval run --config libero_spatial.yaml`.

Why is it gaining traction?

It delivers 47x faster evals via episode sharding and batched GPU inference—2,000 LIBERO episodes in 18 minutes on one H100. Docker images and one-click model configs eliminate setup pain, plus a live leaderboard aggregates 500+ models across 17 benchmarks. As the one framework github for VLA benchmark evaluate, it fills the gap for unified, reproducible comparisons without custom glue code.

Who should use this?

Robotics researchers fine-tuning VLAs like OpenVLA or GR00T need it for cross-benchmark testing without rebuilding envs per paper. Teams chasing leaderboard spots on LIBERO or RoboCasa will love the parallel runs and reproduction reports. Any dev evaluating one framework one key module github for robot sims gets instant reproducibility.

Verdict

Grab it if you're in VLA evals—docs, Docker images, and CLI make it production-ready despite 45 stars and 1.0% credibility score. Early but battle-tested by AllenAI; contribute benchmarks for faster community traction.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

4,441

Followers

Base stars: 45 stars

Bonus: AI verified quality (100%)

Account age: 4,542 days

Repo age: 6 days

License: Apache-2.0

Updated: Mar 20, 2026