cloudflareresearch

Lossless compression of BF16 MLP weights for LLM inference on NVIDIA Hopper GPUs

36
1
100% credibility
Found Apr 20, 2026 at 36 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Cuda
AI Summary

This project offers code for compressing parts of large AI models losslessly to boost their speed on high-end NVIDIA graphics cards.

How It Works

1
🔍 Discover Unweight

You hear about a clever trick to shrink AI model files without losing any details, making chatbots run faster on powerful graphics cards.

2
📖 Explore the details

Read the friendly guide and report to see how it squeezes just the right parts of the model for big speed gains.

3
💻 Get your computer ready

Make sure your super-fast NVIDIA graphics card like H100 is set up with the newest tools for AI work.

4
🛠️ Put the pieces together

Follow easy steps to build the special compression tool right on your machine.

5
🔗 Connect to your AI setup

Blend the new tool into your project that runs large AI language models.

6
⚙️ Fine-tune for best results

Pick the smartest way to use it based on your model's size and needs.

🎉 Speed up your AI

Watch your model use less space and respond quicker, with no loss in smarts or accuracy.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 36 to 36 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is unweight-kernels?

Unweight-kernels delivers CUDA kernels for lossless compression of BF16 MLP weights in LLMs, slashing model size by about 20% overall through techniques like Huffman-coding exponents while keeping full numerical fidelity. It provides encoding, decoding, transcoding, and reconstructive matmul pipelines optimized for NVIDIA Hopper GPUs like H100 or H200, integrating with ThunderKittens for fast inference without extra HBM trips. Build it with a simple make command to get a static library ready for your LLM serving stack.

Why is it gaining traction?

Unlike generic lossless compression algorithms that bloat decode times, unweight-kernels targets BF16 entropy patterns in LLM weights for 30% MLP compression with seamless Hopper WGMMA integration and autotuned pipelines per layer. Developers notice end-to-end throughput gains on models like Llama 3.1 8B, plus hard/easy layer scheduling for better preprocess overlap—no accuracy loss, just leaner memory and faster loads. It stands out in the github lossless compression space by focusing purely on inference acceleration.

Who should use this?

LLM inference engineers deploying 7B+ models on H100/H200 clusters, especially those hitting memory walls with BF16 weights. Teams optimizing Cloudflare-style serving or custom vLLM forks will value the reconstructive matmul for persistent kernels. Hopper-only shops experimenting with weight compression techniques before scaling to production.

Verdict

Promising for Hopper-bound LLM inference but too early at 36 stars and 1.0% credibility—docs are thin, no benchmarks beyond the report, and zero tests mean you'll audit it yourself. Integrate if you're on compatible hardware and need BF16 compression now; otherwise, watch for maturity.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.