demtmeder / audio-vis-align
PublicTraining and evaluation toolkit for audio-visual contrastive representation alignment (CLIP-style, but for audio + video).
Audio-Vis-Align is a research toolkit that trains AI models to understand the relationship between sound and video. It uses two neural networks—one that learns from audio, one that learns from video—that are trained together so that matching sound-video pairs end up close to each other in a shared mathematical space. This allows the AI to find which video goes with which sound, or to classify content in either modality. The toolkit includes everything needed to prepare data, train models on large video datasets, and evaluate how well the trained model can match sounds to their corresponding videos. It was developed by researchers at Shanghai Jiao Tong University and is published as a 2025 academic paper.
How It Works
You have a collection of video clips and want an AI that understands both the audio and visual content together.
You download and install Audio-Vis-Align, which gives you everything needed to teach an AI to understand sound-video pairs.
The toolkit teaches two neural networks—one for sounds, one for video—to place matching pairs close together in a shared understanding space.
Training runs on your graphics cards, gradually improving as it sees thousands of examples of which sounds go with which videos.
You play a sound and ask the AI to find the matching video clip
You show a video and ask the AI to find the matching sound
The toolkit shows you accuracy scores revealing how well your AI learned to connect sounds with visuals—higher numbers mean better understanding.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.