chandar-lab

repository for training action-conditioned latent diffusion world models for robot video generation

42
2
100% credibility
Found May 22, 2026 at 42 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

This is an academic research project that trains robots to imagine future actions. The toolkit lets researchers compare two different approaches: one where the robot learns to reconstruct what it sees pixel-by-pixel, and another where it learns to understand the meaning behind what it sees. The project includes ready-made tools for training robot 'imagination engines' on real robot video data, testing how well they predict future frames, and checking if they can tell when a robot task will succeed or fail. Multiple vision encoders are supported, from standard image compressors to advanced AI vision models, so researchers can easily compare which approach helps robots plan better for real-world tasks.

How It Works

1
📚 You discover the research

You hear about a new study comparing different ways robots learn to imagine future actions, and you're curious to try it yourself.

2
📥 You download the toolkit

You grab the open-source code from the project page and set it up on your computer.

3
🎬 You gather robot videos

You collect footage of a robot performing tasks—like the Bridge V2 dataset with real robot arm movements and the actions that went with them.

4
🔧 You train the feature compressor

For smarter vision encoders, you first train a small adapter that shrinks the rich visual features down to a compact size the robot brain can work with.

5
🧠 You teach the robot to imagine

You train the world model—a kind of robot imagination engine—using the video clips and robot actions so it learns what comes next.

6
You test how well it works
📊
Visual quality checks

Compare generated videos against real ones using picture-perfect metrics

🎯
Action control tests

Check if the robot brain responds correctly to different robot commands

Task success prediction

See if the model can predict whether a robot will succeed at a task

🏆 You discover what works best

You find that semantic encoders—ones that understand what objects are—generally help robots plan better than ones that just copy pixels, even when the pictures look less perfect.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 42 to 42 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is semantic-wm?

This is a Python research project that trains diffusion-based world models for robot video generation. The core idea: given a few starting frames and robot actions, predict what the robot will see next. It wraps multiple pretrained vision encoders (DINOv2, SigLIP2, V-JEPA 2.1, Cosmos, Qwen2.5-VL) into a unified pipeline with adapters that compress high-dimensional features down to a compact latent space the diffusion transformer can work with efficiently.

Why is it gaining traction?

The project directly tackles a concrete question: should robot world models use reconstruction-style latents (pixel-perfect but semantic-blind) or semantic latents (preserving task-relevant information even when visuals look worse)? The paper provides experimental evidence that semantic encoders preserve action information and downstream policy quality better across model scales. For developers, this means fewer hours debugging why a world model looks great but plans poorly. The repository also supports multi-view camera setups and includes a trained pixel decoder alongside the adapter, so you get usable RGB outputs without heavy decoding overhead.

Who should use this?

Robotics researchers comparing representation strategies for manipulation tasks will find the benchmark suite most useful. It includes controllability metrics (testing how well the model responds to action inputs), PCK for spatial structure, and probe accuracy for semantic fidelity. ML engineers building offline RL pipelines or video prediction systems might use the pretrained adapters and evaluation code as a starting point. Pure application developers looking for plug-and-play video generation should look elsewhere; this is firmly a research tool with dataset download scripts, training configs, and metric comparisons as the primary interfaces.

Verdict

The 42-star count and 1.0% credibility score signal early-stage research code rather than production-ready infrastructure. Documentation is present via a solid README and paper link, but expect to read source code to customize beyond the provided presets. If you are benchmarking latent representations for robot learning, the multi-encoder comparison and standardized metrics save meaningful setup time. If you need a world model you can train on custom data tomorrow with no research overhead, this is not yet that.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.