AmesianX / TurboQuant
PublicTurboQuant KV Cache Compression for llama.cpp — 5.2x memory reduction with near-lossless quality | Implementation of Google DeepMind's TurboQuant (ICLR 2026)
TurboQuant implements Google DeepMind's KV cache compression technique in llama.cpp, reducing memory usage by up to 5.2x while preserving FP16-level quality.
How It Works
You hear about running powerful AI chats on your own computer without cloud fees, but big models eat too much memory.
Download the free llama.cpp program that lets everyday folks run AI models at home.
Install TurboQuant, the clever upgrade that squeezes AI memory use by up to 5 times without losing smarts.
Pick a smart AI model file and connect it to your upgraded chat software.
Choose easy settings like 'tbqp3' to activate the compression magic with one click.
Your AI responds just as well, but now fits on everyday computers with way less RAM – victory!
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.