FutureMLS-Lab

FutureMLS-Lab / OSCAR

Public

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

48
6
100% credibility
Found May 23, 2026 at 48 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

OSCAR is a memory-compression technique for AI language models. It analyzes how a model processes conversations, then creates custom optimizations that let the model store its intermediate thoughts using 7 times less memory than normal. This enables running powerful AI assistants on hardware with limited memory, while maintaining nearly the same answer quality. The project integrates with the open-source SGLang serving framework and supports several popular models.

How It Works

1
🤝 You want to run a big AI model but worry about memory

You have an AI assistant you want to use for long conversations, but you're concerned your GPU doesn't have enough memory to keep up with all the thinking.

2
📚 You discover OSCAR can help

You find a tool that claims to shrink how much memory AI models need by up to 7 times, without making the answers worse.

3
📧 You let the model learn your conversation style

The system processes a sample set of your prompts and studies how your AI thinks about answers, creating a personalized compression plan.

4
🔄 You apply the smart compression

The tool creates special rotation settings unique to your model and use case, like a custom tuning profile for your hardware.

5
🚀 You launch your AI with the compression

Your AI model starts up using the optimized settings, now storing its thinking in a compact format that fits in less memory.

Your AI handles long conversations smoothly

The model works just as well as before but uses far less memory, letting you have extended conversations without running out of GPU memory.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 48 to 48 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is OSCAR?

OSCAR is a Python-based KV cache quantization technique that compresses the memory footprint of large language model inference by roughly 7x using 2-bit precision. Instead of storing keys and values in full 16-bit format, it applies calibrated rotations to the activation space before quantizing, then keeps a small BF16 "sink" for recent tokens. The whole pipeline runs inside SGLang: you dump Q/K/V activations on a calibration set, compute the rotation matrices, and serve with INT2 KV cache. Pre-computed rotations are also available on HuggingFace if you do not want to run calibration yourself.

Why is it gaining traction?

The hook is simple: OSCAR is the only 2-bit KV quantization method that does not destroy accuracy on reasoning and coding tasks. QuaRot and naive INT2 approaches collapse on GPQA and AIME benchmarks, while OSCAR stays within a few percentage points of full BF16. The trade-off is real hardware (H100s), careful calibration, and tight integration with SGLang, but the memory savings are significant enough to matter at scale. The paper includes rigorous multi-seed evaluations across five reasoning benchmarks, which gives the numbers more credibility than single-run comparisons.

Who should use this?

Teams running SGLang-based inference serving who want to cut KV cache memory by half or more. This is most relevant for production deployments handling long contexts or high-throughput workloads where memory bandwidth is the bottleneck. Researchers exploring KV cache compression will also find the rotation methodology interesting, though they should budget for the H100 compute required for calibration.

Verdict

OSCAR delivers a real memory reduction with minimal accuracy cost, but the 1.0% credibility score and 48 stars reflect a young, research-grade project. The documentation is solid for a paper release, but test coverage and production hardening are not yet proven. Worth evaluating for memory-constrained serving scenarios, but treat it as an experimental optimization rather than a drop-in production tool.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.