Official Code for See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding (CVPR 2026)
SWIM (See What I Mean) is a research project from university and industry researchers that aims to help AI understand specific objects in videos using natural language descriptions. The project has published a peer-reviewed paper at CVPR 2026, but the actual code, trained models, and dataset are still undergoing internal review and have not yet been released. Users can follow the repository to be notified when these become available. The core innovation involves teaching AI to focus on the correct visual regions when generating descriptions of objects referred to in natural language.
How It Works
You want AI to understand specific objects in videos, but it keeps getting confused or making things up.
A research paper catches your eye—it's about teaching AI to focus on exactly the right object using simple descriptions.
The researchers explain how their method helps AI pay attention to the correct visual areas when describing objects.
You look for the code and models, but find they're still being reviewed before release.
Bookmark it on GitHub to easily find it later and show your interest.
Turn on alerts to get an email the moment everything becomes available.
When the code, models, and dataset are released, you'll be the first to know and can start experimenting.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.