scrya-com

RotorQuant: Clifford algebra vector quantization for LLM KV cache compression. 10-19x faster than TurboQuant, 44x fewer parameters.

36
1
100% credibility
Found Mar 27, 2026 at 36 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

Implements TurboQuant and an improved RotorQuant method to dramatically compress the memory caches used by large language models during extended text processing, enabling longer contexts with minimal accuracy loss.

How It Works

1
🔍 Discover memory saver

You learn about a clever trick to help AI chatbots like those in apps remember super long conversations without eating up all your computer's memory.

2
📥 Get the tool

Download this free helper pack that works with your AI projects to shrink memory use.

3
🔗 Hook it up

Simply attach it to your AI language model, and it quietly starts squeezing the working memory to save space.

4
🧪 Test it out

Run easy built-in checks to confirm it works smoothly on sample data.

5
📈 See huge wins

Watch benchmarks show your AI handling massive texts 5x smaller in memory and up to 19x faster.

6
🤖 Power real AI

Drop it into your chatbot or text generator for endless conversations without slowdowns.

🎉 AI remembers forever

Celebrate as your AI now manages book-length chats on regular hardware, staying sharp and speedy.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 36 to 36 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is rotorquant?

RotorQuant is a Python library for compressing LLM KV caches via vector quantization, reimagining Google's TurboQuant with clifford algebra for dramatic gains: 10-19x faster quantization and 44x fewer parameters while matching attention fidelity on real models like Qwen2.5-3B. It delivers drop-in compressors for keys and values, plus portable Triton kernels that run on NVIDIA/AMD GPUs and Metal for Apple Silicon, slashing cache sizes from GBs to MBs for longer contexts. Run benchmarks or validate on your models with included scripts.

Why is it gaining traction?

It crushes TurboQuant on speed—full fused kernels hit 100-650x over pure PyTorch paths—and parameter efficiency, using compact rotors instead of full matrices, without sacrificing unbiased inner products or needle-in-haystack retrieval. Developers love the one-time setup (pip install + optional Triton) yielding portable acceleration across hardware, plus real-model proofs showing cosine sim >0.99 at 3-5x compression.

Who should use this?

LLM serving engineers pushing 128K+ contexts on 24GB GPUs, where KV cache is the bottleneck. Fine-tuners optimizing inference throughput for Qwen/Llama models, or mobile/edge deployers needing tiny codebooks. Skip if you're on CPU-only or sub-8K contexts.

Verdict

Early alpha (36 stars, 1.0% credibility) with solid docs, tests, and validation scripts, but build CUDA/Triton extensions yourself. Grab it for experiments if TurboQuant's your baseline—benchmarks alone justify the test.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.