OpenMOSS / MOSS-Audio-Tokenizer

Public

MOSS-Audio-Tokenizer is a Causal Transformer-based audio tokenizer built on the CAT architecture. Trained on 3M hours of diverse audio, it supports streaming and variable bitrates, delivering SOTA reconstruction and strong performance in generation and understanding—serving as a unified interface for next-generation native audio language models.

mosi.cnmodelsmoss-tts audio music speech speech-representation tokenizer

131

100% credibility

Found Feb 12, 2026 at 43 stars 3x -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

This repository is the official code for MOSS-Audio-Tokenizer, a high-fidelity neural audio codec that compresses raw audio waveforms into discrete tokens and reconstructs them losslessly.

How It Works

🔍 Discover the audio squeezer

You stumble upon MOSS Audio Tokenizer, a clever tool that packs full sound clips into tiny codes while keeping every detail intact.

📥 Set up your playground

Grab the ready-to-use files and prepare a simple spot on your computer to play with sounds.

🎵 Choose a sound clip

Pick any audio file from your collection, like a voice recording or favorite tune.

✨ Squeeze it into codes

Hit go and see your sound magically shrink into a handful of compact codes that capture everything.

🔄 Rebuild from codes

Feed those codes back in and watch a brand new audio file come to life.

🎉 Hear the magic

Play the rebuilt sound and smile – it matches the original perfectly, ready for your next audio adventure!

Sign up to see the full architecture

4 more

Star Growth

See how this repo grew from 43 to 131 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is MOSS-Audio-Tokenizer?

MOSS-Audio-Tokenizer turns raw 24kHz audio into compact discrete tokens and reconstructs it with near-lossless quality, using a causal transformer CAT architecture trained on 3 million hours of diverse audio. Developers load it via Hugging Face Transformers in Python, calling simple encode/decode methods that handle variable bitrates from 0.125kbps to 4kbps. It acts as a unified interface for audio language models, enabling streaming inference for real-time apps.

Why is it gaining traction?

It delivers state-of-the-art reconstruction across speech, music, and effects without pretrained encoders like Whisper, plus semantic-rich tokens that boost downstream generation and understanding tasks. Streaming via chunk_duration and adjustable bitrates make it practical for low-latency production, standing out from rigid alternatives. Early benchmarks show superior metrics on LibriSpeech and AudioSet at ultra-low rates.

Who should use this?

Audio ML engineers building native audio foundation models or TTS pipelines needing scalable tokenization. Teams in voice agents or music generation wanting CNN-free, end-to-end codecs with streaming. Devs prototyping real-time ASR who prioritize fidelity over complexity.

Verdict

Grab it if you're experimenting with audio language models—solid docs, demo script, and Apache 2.0 license make setup fast. But with just 44 stars and 1.0% credibility score, it's immature; test reconstruction thoroughly before committing.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

131

Stars

Forks

371

Followers

Base stars: 131 stars

Bonus: AI verified quality (100%)

Account age: 783 days

Repo age: 29 days

License: Apache-2.0

Updated: Mar 04, 2026