Mondo-Robotics / DiT4DiT
PublicThis is the official code repo for DiT4DiT, a Vision-Action-Model (VAM) framework that combines video generation model with flow-matching-based action prediction for generalizable robotic manipulation.
DiT4DiT is an open-source framework for training vision-language-action models that enable robots to perform generalizable manipulation tasks from video observations and instructions.
How It Works
You stumble upon DiT4DiT, a clever tool that teaches robots to handle everyday tasks by watching videos and understanding simple instructions.
Follow easy steps to prepare your computer, grabbing the ready-made pieces needed to start experimenting.
Collect short clips of robots picking, stacking, or organizing objects β real examples become the teacher's guidebook.
Hit start and watch it learn to predict smooth actions from sights and words, getting smarter with each lesson.
Run trials in a virtual world to see your robot smoothly grab, move, and arrange things just right.
Perfect skills in a digital space before the real thing.
Link to your hardware and see physical movements happen.
Celebrate as your robot masters picking, stacking, and organizing with confidence and grace.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.