omni2sound

Omni2Sound — Your Multimodal Audio Generation Codebase (CVPR 2026 Highlight)

19
1
100% credibility
Found Apr 27, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

Omni2Sound is a unified open-source tool for generating temporally aligned audio from video inputs, text descriptions, or both, achieving top performance on audio synthesis benchmarks.

How It Works

1
🔍 Discover Omni2Sound

You find this free tool while searching for ways to add realistic sounds to videos or create audio from simple descriptions.

2
📥 Get the tool

Download the ready-to-use package and open it on your computer with a quick launch.

3
Pick your input
📹
Video only

Upload a video and let it create matching sounds like footsteps or music.

💬
Text only

Type words like 'rain on window' to hear lifelike audio.

🎞️
Video + text

Upload video and add text tips for even better matching sounds.

4
Generate magic

Hit the button and watch as it creates synchronized, high-quality audio in seconds.

🎉 Enjoy your audio

Play back the perfectly timed sounds that bring your video or idea to life.

Sign up to see the full architecture

3 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is Omni2Sound?

Omni2Sound is a Python codebase for unified multimodal audio generation, handling video+text-to-audio (VT2A), video-to-audio (V2A), and text-to-audio (T2A) in one model. It produces temporally synced, realistic soundtracks from raw MP4s or text prompts, solving the hassle of separate models for each input type. Download pretrained weights from Hugging Face, run inference via simple shell scripts for WAV/MP4 output, or finetune on custom datasets with provided training pipelines.

Why is it gaining traction?

As a CVPR 2026 highlight, it delivers SOTA results on unified benchmarks like VGGSound-Omni via smart data (SoundAtlas) rather than complex architectures, extending stable-audio-tools for plug-and-play use. Developers dig the online feature extraction—no preprocessing needed—and robustness to off-screen sounds or partial inputs. Open models and benchmarks on HF make experimentation fast.

Who should use this?

Audio ML researchers replicating CVPR audio generation papers, video AI engineers syncing sound to clips in editing pipelines, or indie devs prototyping multimodal apps like auto-dubbed reels. Ideal if you're finetuning on proprietary video-text-audio pairs without rebuilding from scratch.

Verdict

Grab it for research or prototypes—solid docs, runnable scripts, and academic cred make it worth the low 1.0% credibility score despite 19 stars. Still early; expect tweaks for production scale.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.