THUDM

THUDM / IndexCache

Public

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

16
3
100% credibility
Found Mar 16, 2026 at 16 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
AI Summary

IndexCache provides patches for SGLang and vLLM to speed up AI model inference using sparse attention by reusing selected token indices across layers.

How It Works

1
🔍 Discover IndexCache

You hear about IndexCache, a clever trick that makes big AI models handle long conversations much faster by smartly reusing their focus points across processing steps.

2
Pick Your AI Runner
Go with SGLang

Select SGLang if you want super speedy AI serving.

🚀
Choose vLLM

Pick vLLM for flexible and powerful AI inference.

3
Add the Speed Boost

Easily blend the IndexCache magic into your chosen AI runner to cut down redundant work.

4
⚙️ Tune the Speedup Style

Decide on a simple repeating pattern or a custom setup to keep just the right amount of focus points for top speed.

5
▶️ Launch Your Faster AI

Fire up your enhanced AI server, and watch it process prompts lightning-quick.

🎉 Enjoy Blazing Speeds

Your AI now thinks up to 1.8 times faster on long texts with almost no loss in quality, making chats and tasks fly.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 16 to 16 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is IndexCache?

IndexCache delivers patches for SGLang and vLLM, accelerating sparse attention via cross-layer index reuse in DeepSeek Sparse Attention (DSA) models like DeepSeek-V3.2 and GLM-5. It caches and shares top-k token indices across layers, cutting indexer compute by up to 75% at long contexts where it dominates runtime. Users launch servers with simple flags like `--json-model-override-args '{"index_topk_freq": 4}'` for instant speedups, no extra GPU memory needed.

Why is it gaining traction?

It crushes the O(L²) indexing bottleneck in DSA with 1.82x prefill and 1.48x decode boosts on H100s, verified on 744B GLM-5 production workloads—negligible quality loss via training-free or distillation modes. Drop-in patches apply cleanly to specific commits, configurable via freq or custom patterns. Tsinghua/Z.ai backing plus arXiv paper add research cred over generic sparse tweaks.

Who should use this?

Inference teams running DSA models on SGLang or vLLM for long-context apps like 200K-token RAG or reasoning benchmarks. Suited for H100-scale deployments where prefill stalls kill throughput, especially GLM-5 or DeepSeek-V3.2 users chasing E2E gains without model retraining.

Verdict

Worth patching in for DSA inference if you hit indexing walls—benchmarks hold up, docs shine with quick starts. But 1.0% credibility score and 16 stars signal early days; verify on your commit and calibrate patterns before prod.

(178 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.