shisa-ai

shisa-ai / FastDMS

Public

Production-speed compact Dynamic Memory Sparsification (DMS) for KV cache compression

11
1
100% credibility
Found May 05, 2026 at 11 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

FastDMS is a Python library for running certain AI language models much faster while using far less memory through smart compression.

How It Works

1
🔍 Discover FastDMS

You hear about a simple way to make AI chatbots run much faster on your own computer using less memory.

2
📦 Install easily

With one quick command, you add it to your Python setup, no complicated steps needed.

3
🤖 Grab a speedy model

Download a ready-to-use AI brain from a trusted spot online.

4
Start chatting

Write a few lines of code, ask a question, and watch it respond lightning-fast.

5
📈 Feel the difference

Notice how it thinks quicker and uses way less computer power than before.

🚀 Supercharged AI

Now you enjoy blazing responses for stories, questions, or ideas anytime.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 11 to 11 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is FastDMS?

FastDMS is a Python serving engine that delivers production-speed compact Dynamic Memory Sparsification (DMS) for KV cache compression in LLMs. It slashes memory use by 4-8x compared to dense BF16 or FP8 caches while decoding 1.5-2x faster than vLLM on models like Llama-3.2-1B and Qwen3-8B. Users pip install it, load a DMS-trained checkpoint from Hugging Face, and call a simple generate API for batched inference with long contexts.

Why is it gaining traction?

It outperforms vLLM baselines in real workloads—exact token pools, no over-provisioning—with FP8 compact KV and allocator-visible savings, plus quality on par with baselines (low KLD, high token match). The standalone design avoids vLLM plugin complexity, and it includes a quick training recipe for custom DMS eviction heads. Benchmarks on WikiText-2 and max-context tests make claims verifiable.

Who should use this?

LLM serving engineers handling long-context inference on consumer GPUs like RTX 6000, where KV cache balloons at 128k tokens. Ideal for multi-tenant apps needing 5-6x memory compression without quality loss, or researchers prototyping DMS on Llama/Qwen before scaling.

Verdict

Promising reference for compact DMS—try the quickstart with shisa-ai/Llama-3.2-1B-DMS-8x—but at 11 stars and 1.0% credibility, it's early; lacks broad model support and production hardening. Use for experiments, not yet for critical deploys.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.