AmesianX

AmesianX / TurboQuant

Public

TurboQuant KV Cache Compression for llama.cpp — 5.2x memory reduction with near-lossless quality | Implementation of Google DeepMind's TurboQuant (ICLR 2026)

19
3
100% credibility
Found Apr 03, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
C++
AI Summary

TurboQuant implements Google DeepMind's KV cache compression technique in llama.cpp, reducing memory usage by up to 5.2x while preserving FP16-level quality.

How It Works

1
💡 Discover local AI magic

You hear about running powerful AI chats on your own computer without cloud fees, but big models eat too much memory.

2
📥 Grab the AI chat software

Download the free llama.cpp program that lets everyday folks run AI models at home.

3
🔧 Add the memory supercharger

Install TurboQuant, the clever upgrade that squeezes AI memory use by up to 5 times without losing smarts.

4
🧠 Load your favorite AI brain

Pick a smart AI model file and connect it to your upgraded chat software.

5
⚙️ Flip on memory saver mode

Choose easy settings like 'tbqp3' to activate the compression magic with one click.

🎉 Chat smarter, use less memory!

Your AI responds just as well, but now fits on everyday computers with way less RAM – victory!

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is TurboQuant?

TurboQuant brings Google DeepMind's TurboQuant algorithmus (ICLR 2026) to llama.cpp, compressing KV cache to 3-4 bits for up to 5.2x memory reduction while keeping near-lossless FP16 quality. Written in C++, it slashes VRAM needs for long-context inference on models like Qwen3.5, letting you run bigger batches or extend contexts without OOM errors. Just add flags like `--cache-type-k tbqp3 --cache-type-v tbq3` to your llama.cpp CLI.

Why is it gaining traction?

Unlike basic q8_0 or q4_0 cache quant, TurboQuant uses online vector quantization with near-optimal distortion rates, often matching or beating FP16 perplexity in benchmarks—even speeding up generation by 12%. Auto head_dim detection handles tricky dims like 576 (GLM-4.7-Flash) or 64 with smart fallbacks, plus fixes for corruption bugs. Developers dig the drop-in compatibility with llama-bench and Ollama-style turbiquant models.

Who should use this?

Llama.cpp users maxing out GPU memory on 30B+ models during long chats or RAG pipelines. Ideal for edge deployers squeezing 122B params onto consumer cards, or anyone benchmarking turbo quant AI for production inference. Skip if you're on CPU-only or non-llama.cpp stacks.

Verdict

Grab it if VRAM is your bottleneck—benchmarks deliver real 5.2x wins with minimal quality hit. At 19 stars and 1.0% credibility, it's early (v1.4.0 just fixed key bugs), so test thoroughly on your models before prod, but the DeepMind roots make it worth watching.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.