Mondo-Robotics

This is the official code repo for DiT4DiT, a Vision-Action-Model (VAM) framework that combines video generation model with flow-matching-based action prediction for generalizable robotic manipulation.

142
4
100% credibility
Found Apr 15, 2026 at 75 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

DiT4DiT is an open-source framework for training vision-language-action models that enable robots to perform generalizable manipulation tasks from video observations and instructions.

How It Works

1
πŸ” Discover robot learning magic

You stumble upon DiT4DiT, a clever tool that teaches robots to handle everyday tasks by watching videos and understanding simple instructions.

2
πŸ› οΈ Set up your robot workshop

Follow easy steps to prepare your computer, grabbing the ready-made pieces needed to start experimenting.

3
πŸ“š Gather robot lesson videos

Collect short clips of robots picking, stacking, or organizing objects – real examples become the teacher's guidebook.

4
πŸŽ“ Train the robot's brain

Hit start and watch it learn to predict smooth actions from sights and words, getting smarter with each lesson.

5
πŸ§ͺ Test in a safe playground

Run trials in a virtual world to see your robot smoothly grab, move, and arrange things just right.

6
Bring to life
πŸ–₯️
Simulation practice

Perfect skills in a digital space before the real thing.

πŸ”Œ
Real robot connection

Link to your hardware and see physical movements happen.

✨ Robot handles real tasks

Celebrate as your robot masters picking, stacking, and organizing with confidence and grace.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 75 to 142 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is DiT4DiT?

DiT4DiT is a Python framework for DiT4DiT, jointly modeling video dynamics and actions for generalizable robot control. It combines video generation transformers with flow-matching-based action prediction into a Vision-Action-Model that handles tabletop manipulation and real-time whole-body humanoid control. Developers get training scripts, pretrained checkpoints via Hugging Face, and a WebSocket model server for deploying policies on sim datasets like RoboCasa or real hardware.

Why is it gaining traction?

It stands out by delivering the first efficient real-time VAM for humanoids, beating baselines with 56% average success on RoboCasa-GR1 benchmarks across pick-place tasks. The official GitHub repository releases page offers easy access to code, configs, and models, plus DeepSpeed integration for multi-GPU training on LeRobot data. Users notice smooth sim-to-real transfer without heavy per-robot finetuning.

Who should use this?

Robotics engineers building manipulation policies for Unitree G1 humanoids or Franka arms on LeRobot/RoboCasa datasets. Suited for researchers tackling generalizable control in shelf organization, drawer interactions, or novel pick-place, where video-conditioned actions need to run at 1x speed in deployment.

Verdict

Grab it if humanoid manipulation is your focusβ€”the pretrained models and server make testing fast, despite low maturity (74 stars, 1.0% credibility score). Still alpha with pending real-robot teleop releases, so expect some setup tweaks.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.