FutureMLS-Lab / OSCAR
PublicOSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
OSCAR is a memory-compression technique for AI language models. It analyzes how a model processes conversations, then creates custom optimizations that let the model store its intermediate thoughts using 7 times less memory than normal. This enables running powerful AI assistants on hardware with limited memory, while maintaining nearly the same answer quality. The project integrates with the open-source SGLang serving framework and supports several popular models.
How It Works
You have an AI assistant you want to use for long conversations, but you're concerned your GPU doesn't have enough memory to keep up with all the thinking.
You find a tool that claims to shrink how much memory AI models need by up to 7 times, without making the answers worse.
The system processes a sample set of your prompts and studies how your AI thinks about answers, creating a personalized compression plan.
The tool creates special rotation settings unique to your model and use case, like a custom tuning profile for your hardware.
Your AI model starts up using the optimized settings, now storing its thinking in a compact format that fits in less memory.
The model works just as well as before but uses far less memory, letting you have extended conversations without running out of GPU memory.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.