This is the official code repository for academic research that tests whether video-capable AI models truly understand audio or rely on visual shortcuts. The project provides tools to train multimodal AI assistants to better verify audio information, and evaluation scripts to measure how well these models understand audio-video relationships across various benchmarks including video question answering, audio-visual synchronization detection, and long video comprehension.
How It Works
You learn about a new way to test whether AI models truly understand sound in videos or just guess from pictures.
You read about the diagnostic tests that check if an AI assistant pays attention to audio or takes shortcuts using visual cues alone.
You fine-tune a multimodal AI model using special training data that teaches it to verify audio information, not just rely on visuals.
Run tests on video question answering benchmarks to see how well the assistant understands content.
Check if the assistant can detect whether sound and picture are properly aligned or out of sync.
Run comprehensive tests across multiple scenarios to get a complete picture of capabilities.
You receive breakdown scores showing performance across different question types, video categories, and difficulty levels.
You gain clear insights into whether your AI assistant genuinely listens to audio or relies on visual shortcuts, with actionable metrics.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.