hustvl

hustvl / MoDA

Public

An hardware-aware Efficient Implementation for "Mixture-of-Depths Attention".

91
2
100% credibility
Found Mar 18, 2026 at 91 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
AI Summary

MoDA is a research project presenting Mixture-of-Depths Attention, a technique to enhance large language models by enabling attention heads to access key-value pairs from prior layers, with planned code releases for kernels and training recipes.

How It Works

1
🔍 Discover MoDA

You stumble upon this new research project on GitHub that promises to make AI models smarter by better using their deeper layers.

2
📖 Read the story

You explore the paper and overview to understand how MoDA lets AI pay attention to important info from earlier steps without losing it.

3
📊 See the wins

You get excited looking at charts and tables showing MoDA improves performance on language tasks and runs almost as fast as top methods.

4
🛠️ Get your tools ready

You gather the everyday building blocks needed for AI experiments, like fresh math libraries and data handlers.

5
📦 Add MoDA to your kit

You slip the MoDA pieces into your setup from the project's special folder to unlock its powers.

6
▶️ Give it a spin

You launch a quick test to watch MoDA blend attention from current and past layers in action.

MoDA shines

Your AI experiments now benefit from deeper smarts with better results and smooth speed, ready for bigger creations.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 91 to 91 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is MoDA?

MoDA delivers a hardware-aware efficient implementation of Mixture-of-Depths Attention, letting attention heads in deep LLMs pull KV pairs from the current layer and prior depths to fight signal degradation as models scale deeper. Built on PyTorch and Triton kernels, it matches 97.3% of FlashAttention-2's speed at 64K sequences on A100s, with users installing a local package to swap it into transformer stacks for training or inference. Developers get plug-and-play attention that boosts perplexity by 0.14 on average and downstream scores by 2.11% at just 3.7% extra FLOPs.

Why is it gaining traction?

Unlike plain depth residuals or dense projections, MoDA mixes sequence and depth attention data-dependently, delivering real gains in validation benchmarks without bloating memory or compute much. Hardware-aware transformers like this stand out for efficient natural language processing, especially chunked layouts that slash access overhead in group-query attention setups. The hook? Proven kernel benchmarks scaling to long contexts and deep stacks, plus easy testing via a single Python script.

Who should use this?

LLM engineers training 1B+ models on long sequences, seeking hardware-aware algorithms for efficient machine learning without refactoring from scratch. Fine-tuners optimizing deep transformers for NLP tasks like classification or generation on A100/H100 clusters. Researchers prototyping depth-scaling experiments beyond standard OLMo baselines.

Verdict

Promising for efficient attention in deep LLMs, but at 91 stars and 1.0% credibility, it's pre-release vaporware—kernels and full training recipes are TODOs, so hold off for production. Watch this repo if you're into hardware-aware implementations; star it and revisit post-paper code drop.

(178 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.