caiovicentino

EOQ: Entropy-Optimal Quantization for LLMs. 11-41% smaller than GGUF Q4_K_M with near-FP16 perplexity.

14
2
100% credibility
Found Mar 30, 2026 at 14 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

Implements EOQ, an entropy-optimal quantization method using absmax quantization and rANS entropy coding to compress large language model weights with minimal quality loss.

How It Works

1
🔍 Discover EOQ

You hear about a clever way to shrink huge AI models so they fit on everyday computers without losing smarts.

2
📖 Explore the project

Visit the page to see impressive charts showing models 3x smaller but just as capable, with real chat examples.

3
🚀 Try the chat demo

Run a quick chat with a tiny AI brain that feels full-sized, amazed at how fast and smart it responds.

4
📥 Grab a ready model

Download one of the pre-shrunk AI models from the links and load it up in seconds.

5
💬 Chat away

Start conversations with your slimmed-down AI helper, enjoying the speed and low memory use.

🎉 Shrink your own AI

Compress your favorite model to pocket size and share it with friends, bringing powerful AI everywhere.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 14 to 14 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is eoq-quantization?

eoq-quantization compresses LLM weights using entropy-optimal quantization in Python, delivering 11-41% smaller files than GGUF Q4_K_M while keeping near-FP16 perplexity. It applies block absmax quantization followed by rANS entropy coding to create .eoq files that load as standard FP16 safetensors via transformers—no custom runtime needed. Users get 2-3.5x smaller downloads and GPU-accelerated dequantization for 5x faster loading.

Why is it gaining traction?

It matches complex K-quants in quality-per-byte but with simpler code and zero inference slowdown, plus dynamic mixed-bit widths (3-6 bits per tensor) and PolarQuant+AWQ for practically lossless results. Published Qwen models on HuggingFace prove 3.58x compression at +0.06 perplexity delta, and torchao integration hits 95% FP16 speed with 65% less VRAM. Developers notice the halved download times and seamless compatibility first.

Who should use this?

LLM inference engineers deploying Qwen or GLM models on consumer GPUs like RTX 6000, where download size and VRAM matter. Local AI tinkerers quantizing 7-70B models for faster prototyping, or edge deployers needing sub-5GB files without quality hits. Skip if you rely on llama.cpp exclusively.

Verdict

Promising for quantization-heavy workflows—try the published models today. Low 14 stars and 1.0% credibility score signal early-stage code; docs are solid via README but expect some rough edges until more adoption.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.