Dao-AILab

Fast Polar Decomposition for Muon

86
5
100% credibility
Found Apr 01, 2026 at 86 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A library offering a hardware-optimized algorithm for matrix operations essential to advanced AI model training optimizers like Muon.

How It Works

1
🔍 Discover Faster AI Training

You hear about a clever math trick that makes training big AI models up to twice as fast on powerful computers.

2
💻 Get the Tool Ready

You download the free tool and set it up on your high-end computer with the right graphics card.

3
🧪 Try the Sample

You run a quick example script to see the speedy math magic in action right away.

4
🚀 Power Up Your AI Trainer

You connect the tool to your AI model's learning setup, choosing options for even better speed and stability.

5
Run Benchmarks

You test different versions to confirm it's faster than the usual methods.

6
📈 Train Smarter and Faster

Your AI model trains quicker with the same quality, saving time and energy.

🎉 Enjoy the Speed Boost

Now your projects finish sooner, letting you experiment more and build better AI.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 86 to 86 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is gram-newton-schulz?

Python library for fast polar decomposition of an arbitrary matrix, using a Gram Newton-Schulz iteration that's a drop-in swap for standard Newton-Schulz in PyTorch apps. It normalizes and orthogonalizes batches of matrices up to 2x faster on NVIDIA Hopper (H100) or Blackwell (B200/B300) GPUs, powering the Muon optimizer to update transformer weights like QKV projections without accuracy hits. Install via pip and call it directly on tensors or wrap in Muon for full optimizer use.

Why is it gaining traction?

Delivers measurable speedups via symmetric GEMM kernels tuned for Hopper/Blackwell, plus autotuning for restart points to ensure stability across coefficients. Supports custom splits for QKV/SwiGLU weights, scalar optimizers for norms, and presets like YOU or Polar Express coefficients. Benchmarks show clear wins over PyTorch baselines for large matrix batches in training loops.

Who should use this?

ML engineers training LLMs on H100/B200 clusters with Muon-style orthogonal optimizers, especially when orthogonalizing QKV or FFN weights in transformers. Fits researchers tweaking custom NS coefficients or needing fast polar decoders algorithm and implementation for decomposition-heavy schedules.

Verdict

Grab it if you're on target hardware and chasing faster decomposition in optimizers—examples and benchmarks make trials easy. At 86 stars and 1.0% credibility score, it's immature with no broad tests, so validate stability first before production.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.