tonbistudio / turboquant-pytorch
PublicFrom-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for LLM KV cache compression. 5x compression at 3-bit with 99.5% attention fidelity.
A ready-to-test toolkit that compresses AI language models' temporary memory to support much longer inputs with little quality loss.
How It Works
You come across this clever trick that squeezes an AI's working memory to handle much longer stories without needing more power.
You gather the simple everyday tools on your computer to start experimenting, like having Python and a few helpers.
You launch the easy tests with pretend data to see how well the memory squeeze works right away.
Watch in amazement as the tests show 3 to 7 times smaller memory use while keeping matches super accurate.
You load a small language model, feed it a long hidden-fact story, and apply the squeeze to its memory.
The AI still focuses on the right spots almost perfectly, proving longer chats are now possible.
Your AI now remembers epic lengths of text, opening doors to smarter, bigger conversations.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.