amazon-far

amazon-far / deltatok

Public

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens (CVPR 2026 Highlight)

46
1
100% credibility
Found Apr 09, 2026 at 46 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

This repository implements DeltaTok for compressing video frame differences into single tokens and DeltaWorld for autoregressively predicting future frames, with code for training on action videos and evaluating on perception tasks like segmentation and depth estimation.

How It Works

1
🔍 Discover DeltaTok

You stumble upon this exciting project that helps computers understand videos by capturing tiny changes between frames and predicting what happens next.

2
💻 Set up your workspace

You install simple tools on your computer to get everything ready for working with videos.

3
📥 Gather video collections

You download sets of real-world videos like action clips and street scenes to teach the system.

4
🚀 Train the frame compressor

You launch the training so it learns to squeeze each video change into one smart token, watching progress as it improves.

5
🔮 Build the future predictor

Using the compressor, you train a companion that dreams up possible next moments in videos.

6
Test on new videos

You run it on fresh clips to see predictions for tasks like spotting objects or measuring distances.

🎉 Unlock video magic

Now you have a powerful tool that efficiently models and forecasts video worlds, ready for your experiments.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 46 to 46 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is deltatok?

DeltaTok compresses video frame-to-frame changes into a single token using frozen DINOv3 features, slashing sequence length for efficient world modeling – a frame is worth one token. DeltaWorld autoregressively generates diverse future tokens from these, supporting downstream tasks like segmentation, depth, and RGB reconstruction. Python repo with PyTorch Lightning CLI for training on Kinetics-700 and eval on VSPW, Cityscapes, KITTI; grab pretrained models from Hugging Face for instant github frame generation.

Why is it gaining traction?

One-token-per-frame beats patch-heavy alternatives like DINOWorld, enabling longer horizons with less compute – ideal for deltatoken latest experiments. CLI-driven workflows (python main.py fit/validate) and multi-GPU scaling make reproduction painless, plus CVPR 2026 highlight badge draws frame-worthy attention. Prebuilt task heads deliver mIoU 58-70 on seg, RMSE 2.8 on depth out-of-box.

Who should use this?

Video generation researchers building world models for robotics or sims. Autonomous driving engineers needing frame pack predictions on KITTI/Cityscapes. ML devs prototyping efficient tokenizers beyond standard ViTs.

Verdict

Early but frame-worthy: 46 stars and 1.0% credibility reflect academic freshness, yet detailed README, HF models, and solid docs lower barriers. Test for github frame io if token efficiency matters; skip for production stability.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.