arozanov

TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.

25
4
100% credibility
Found Mar 29, 2026 at 25 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

This project adds memory compression to AI language models on Apple hardware to boost speed and reduce resource usage.

How It Works

1
🔍 Discover the speed booster

You learn about a handy tool that makes AI language models run faster and use less memory on Mac computers.

2
📥 Add it to your setup

You easily download and install this helper alongside your AI model runner.

3
🤖 Load your favorite AI

You pick a smart language model and get it ready to chat.

4
🚀 Switch to memory saver

You turn on the special compression mode that shrinks the working memory smartly without hurting the AI's thinking.

5
💬 Ask away

You type a question or prompt, and the AI starts generating answers just like before.

Feel the difference

Your AI responds quicker with much less memory used, letting you handle longer chats smoothly.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 25 to 25 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is turboquant-mlx?

Turboquant-mlx brings TurboQuant KV cache compression to MLX on Apple Silicon, slashing transformer memory use by up to 4.6x while hitting 98% of FP16 speed via fused Metal kernels. Written in Python, it acts as a drop-in replacement for mlx-lm's standard cache, letting you run bigger models or longer contexts without swapping to disk. Just load your mlx-lm model, apply a patch for fused attention, create a compressed cache, and generate as usual.

Why is it gaining traction?

It crushes KV cache bloat—critical for autoregressive generation—delivering real compression (like 2.4x at 3-bit) with layer-adaptive modes that keep sensitive layers in FP16 for quality. Fused Metal kernels ensure decode speeds stay close to baseline, unlike slower scalar quantizers, and bit-packed storage maximizes gains. Developers dig the zero-config integration with mlx-lm and end-to-end demos benchmarking Qwen models.

Who should use this?

MLX users on M-series Macs pushing 7B+ LLMs for local inference, especially those hitting memory walls at long contexts (4K+ tokens). Ideal for AI researchers prototyping on laptops or indie devs building chat apps without cloud costs. Skip if you're on non-Metal hardware or need sub-1B models.

Verdict

Promising early experiment (19 stars, 1.0% credibility) with solid README, tests, and Apache license—worth a test on your Qwen workflow if you're memory-constrained. Still immature; watch for upstream mlx-lm merge before production.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.