Alberto-Codes / turboquant-vllm
PublicTurboQuant KV cache compression for consumer GPUs — 3.76x compression validated on Molmo2 + RTX 4090
TurboQuant-vLLM is a drop-in plugin for vLLM that compresses the key-value cache by up to 3.76x to reduce memory usage during AI inference while maintaining near-identical output quality.
How It Works
You're running an AI assistant for chat or video analysis, but it crashes on long conversations or videos because your computer's graphics memory fills up.
You find turboquant-vllm, a simple add-on that squeezes your assistant's memory use by up to 4 times without losing smarts.
Run one quick command to add it to your setup, no complicated steps.
Add a single flag when starting your AI server, and it automatically uses less memory.
Now process longer videos or chats that used to crash, seeing the same high-quality responses.
Your assistant runs smoother on everyday hardware, handling more at once with no noticeable difference in results.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.