xiaomi-research

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

19
0
100% credibility
Found Apr 22, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

ControlFoley generates synchronized audio for videos using text descriptions, reference sounds, or video motion alone.

How It Works

1
🔍 Discover ControlFoley

You find this fun tool that adds matching sounds to your videos using simple descriptions.

2
🌐 Try the online demo

Upload a video clip and type what sounds you want, like 'wheels grinding on pavement' - hear it sync perfectly!

3
💻 Set up on your computer

Follow easy steps to install the free software so you can use it anytime.

4
📥 Download the sound models

Grab the ready-made AI models from the safe online library.

5
📹 Pick your video and ideas

Choose a video, add a text description or sample sound to guide the audio.

6
🎛️ Generate the sounds

Hit go and watch as realistic, timed audio appears for your video.

🎉 Your video sings!

Enjoy your video with perfect, lifelike sounds that match every moment.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is controlfoley?

ControlFoley is a Python framework for unified, controllable video-to-audio generation that handles cross-modal conflicts, like when text prompts clash with video content. Feed it a video clip plus optional text descriptions or reference audio, and it outputs synchronized sound effects, dubbing, or full audio tracks—up to 8 seconds long at 44kHz. Developers get pretrained models from Hugging Face and a CLI for quick inference across modes like text-video, audio-controlled, or pure text-to-audio.

Why is it gaining traction?

It stands out by robustly managing conflicting inputs through joint visual encoding and timbre control, delivering better sync and quality than single-modality tools on benchmarks like VGGSound. The online demo and project page let users test controllable generation instantly, while SOTA results on new conflict-handling metrics draw ML folks experimenting with multimodal audio. Low barrier: install deps, download weights, run inference.

Who should use this?

Audio ML researchers benchmarking video-synchronized generation or fine-tuning on custom datasets. Video editors scripting foley effects for short clips, like social media dubs where text overrides visuals. Devs prototyping apps needing text-guided sound matching, such as AR filters or game asset tools.

Verdict

Promising research code for controllable video-to-audio with solid benchmarks, but at 19 stars and 1.0% credibility, it's early-stage—expect tweaks for production. Grab it if you're in multimodal gen; Apache-licensed code ships fast, though non-commercial models limit deployment.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.