meituan-longcat

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

46
0
100% credibility
Found May 26, 2026 at 46 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

WBench is a comprehensive evaluation framework created by Meituan that tests AI video generation models across 22 different metrics organized into 5 dimensions: video quality, scene consistency, instruction following, physical realism, and setting accuracy. It evaluates how well models handle interactive, multi-step video generation — like responding to camera movements, object manipulations, and perspective changes. The project includes a leaderboard ranking 20 different video models and provides detailed diagnostic reports showing exactly where each model excels or struggles. It's designed for researchers and developers who want to understand and compare video world models.

How It Works

1
🔍 You discover a way to test AI video models

You hear about WBench — a benchmark that evaluates how well AI video generators respond to interactive instructions, like moving through a scene or changing objects.

2
📦 You download and set up the project

You get a copy of WBench on your computer. The setup checks that all the tools are working correctly so everything runs smoothly.

3
🎬 You generate videos with your AI model

Your video model creates short clips based on interactive instructions — like a character walking forward, then picking up an object, then the camera switching perspective.

4
Everything gets analyzed automatically

WBench runs 22 different tests on your videos — checking if the video looks good, if objects stay consistent, if physics makes sense, and if your instructions were followed.

5
📊 You receive detailed scores across 5 areas

Your results are broken into five categories: video quality, scene consistency, how well instructions were followed, physical realism, and setting accuracy. Each has its own score.

6
You see how your model performs
🏆
Your model ranks highly

Your model performs well — you see strong scores across most dimensions and can share these results.

📈
You identify areas to improve

Your model has weaknesses — maybe physics isn't realistic or objects change unexpectedly. You know exactly what to work on.

You have a complete picture of your model

You now understand your video model's strengths and weaknesses across every dimension, with clear scores and comparisons to other models.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 46 to 46 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is WBench?

WBench is a Python evaluation framework for interactive video world models -- systems that generate video responses to user actions like moving through a scene or changing perspective. It tests whether these models maintain visual coherence across multiple turns of interaction, measuring everything from video quality and subject consistency to physics plausibility and navigation accuracy. The pipeline runs in three phases: precomputes masks and depth maps, runs GPU-based metrics like aesthetic quality and temporal smoothness, then calls a VLM API for semantic evaluation. You point it at generated videos, give it case definitions, and get back scores across 22 metrics organized into 5 dimensions. Built-in support exists for Wan, Kling, and Seedance models, with a generic interface for adding others.

Why is it gaining traction?

The benchmark fills a gap: most video generation tools score well on single clips but fall apart under multi-turn control. WBench exposes this by testing navigation trajectories, perspective switches, and event edits -- interactions that reveal consistency failures invisible in static evaluation. The 20-model leaderboard provides immediate context for where your model stands. The metrics are validated against human judgments, which matters when you're using automated scores as a proxy for quality. The setup is heavy (multiple conda environments, external weights, API keys) but the CLI is straightforward once dependencies are resolved.

Who should use this?

World model researchers comparing video generation approaches. Teams building interactive video applications who need to audit consistency failures before deployment. Anyone benchmarking video models against a standardized multi-turn protocol rather than cherry-picked single prompts. Not useful for simple video quality benchmarking -- use VBench or similar for that.

Verdict

A well-scoped research benchmark with a credible methodology and useful leaderboard, but the 1.0% credibility score and 46 stars reflect its early-stage status. The multi-environment setup and API dependency for VLM metrics add friction. Worth evaluating if you're working in interactive video world models; treat as a research tool rather than production infrastructure.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.