H-EmbodVis

H-EmbodVis / NUMINA

Public

[CVPR 2026] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

45
4
100% credibility
Found Apr 13, 2026 at 45 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

NUMINA is a training-free add-on for the Wan2.1 text-to-video model that corrects mismatches between specified object counts in prompts and the actual numbers generated in videos.

How It Works

1
📰 Discover perfect videos

You learn about a clever fix that makes AI video creators show exactly the number of objects you describe, like three cats playing instead of two or four.

2
📥 Get the video maker

You download a free video-generating program that turns text descriptions into smooth animations.

3
Add the counting magic

You easily slip in the special tool that ensures the right number of things appear by tweaking how the AI pays attention inside.

4
✏️ Describe your scene

You write a simple description of your video, like 'two kittens with two yarn balls', and note the exact counts you want.

5
▶️ Make the video

You tell the program to create it, and it runs a quick preview to spot counts then refines to match perfectly.

🎉 Watch exact results

You enjoy your video with the precise number of objects, feeling thrilled at the accurate, lively animation ready to share.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 45 to 45 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is NUMINA?

NUMINA is a Python framework that fixes numerical misalignment in text-to-video diffusion models like Wan2.1, where prompts like "three cats" often generate wrong counts. It runs a two-phase process: first analyzes attention maps during a partial denoising run to build spatial layouts of objects, then modulates attention in a full regeneration to enforce exact counts—all training-free. Users integrate it via simple file copies into Wan2.1, then run CLI commands like `python generate.py --numina --numina_noun_counts '{"cats": 3}'` for precise video output.

Why is it gaining traction?

Unlike brute-force seed search or LLM prompt rewrites, NUMINA intervenes directly at the attention level for interpretable, principled control, boosting counting accuracy up to 7.4% on benchmarks like CountBench. It pairs with inference accelerators like EasyCache for faster runs and includes GroundingDINO-based eval scripts to verify results. Devs grab it from GitHub alongside CVPR 2026 papers for that edge in T2V experiments.

Who should use this?

Video AI researchers prototyping precise generations, like animators needing exact object multiples in scenes. T2V pipeline hackers extending Wan2.1 for apps with numbered elements, such as data viz videos or game cutscenes. Folks tracking CVPR 2026 accepted papers on GitHub for SOTA diffusion hacks.

Verdict

Solid pick for Wan2.1 users chasing numerical fidelity—strong README with demos and eval tools make it dead simple to test. At 45 stars and 1.0% credibility, it's early but mature enough for experiments; watch for CVPR 2026 reviews as it matures.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.