mudler

MoE-aware mixed-precision quantization

19
1
100% credibility
Found Apr 02, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

APEX provides optimized compression techniques for large Mixture-of-Experts AI models to drastically reduce file sizes and boost speed while preserving quality, with ready-to-use profiles and benchmarks.

How It Works

1
🔍 Discover APEX

You learn about APEX, a smart way to shrink huge AI models so they run faster on everyday computers without losing their thinking power.

2
📥 Pick Your AI Model

Download a large AI model file that you want to make smaller and quicker to use.

3
Choose Your Size
🏆
High Quality

Keeps the sharpest thinking for tough tasks.

⚖️
Balanced

Great mix of speed and smarts for daily use.

📱
Compact

Tiny size for computers with less memory.

4
Compress the Model

Use the simple tool to squeeze your model down – it gets much smaller and zippier while staying just as clever.

5
📊 Check the Magic

See the test results showing it's faster, smaller, and scores high on smarts checks.

6
🚀 Run It Locally

Load your new slim model into a local AI app and start chatting or creating.

🎉 AI Supercharged

Enjoy lightning-fast AI responses on your own hardware, saving time and space every day.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is apex-quant?

Apex-quant is a Python toolkit for moe-aware mixed-precision quantization of Mixture-of-Experts models using llama.cpp. It generates GGUF files in five tiers—from 21GB Quality to 12GB Mini—that deliver Q8_0-level perplexity and accuracy at half the size, with faster inference on stock llama.cpp. Run `./scripts/quantize.sh --quality model-f16.gguf output.gguf` or from a HuggingFace ID for instant apex quant results, no code changes needed.

Why is it gaining traction?

It crushes Unsloth dynamic quants and bartowski IQ formats on perplexity, HellaSwag, MMLU, and speed while using 2x less VRAM—APEX Mini beats IQ2_M across all metrics at similar size. Developers dig the layer-wise precision gradients exploiting MoE sparsity, diverse imatrix for real-world tasks, and full benchmark suite with plots. Pairs seamlessly with LocalAI for local API serving.

Who should use this?

LocalAI users deploying 30-35B MoE models like Qwen3.5 on 16-24GB consumer GPUs. Quantization enthusiasts optimizing for llama.cpp inference without quality loss. Teams needing compact GGUF files for edge hardware or long-context runs.

Verdict

Grab it if you're quantizing MoE models—benchmarks prove it works, and scripts make it dead simple despite 19 stars and 1.0% credibility score signaling early maturity. Polish tests and add more model support to hit escape velocity.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.