onur-gokyildiz-bhi

Pure Rust implementation of Google's TurboQuant (ICLR 2026) β€” KV cache compression for LLMs

11
0
100% credibility
Found Mar 31, 2026 at 11 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Rust
AI Summary

tq-kv is a Rust library for TurboQuant KV cache compression that enables efficient local inference of large language models by drastically reducing memory usage while preserving quality.

How It Works

1
πŸ” Discover Memory Magic

You're excited to run big AI chatbots at home but frustrated by high memory use, then you find tq-kv promising huge savings.

2
πŸ“‰ See the Wow Factor

Check impressive charts showing your AI's memory shrinks 7-15 times while keeping smart replies sharp and fast.

3
πŸ“₯ Grab the Tools

Download the library and easy patches tailored for popular AI runners like llama.cpp.

4
πŸ”§ Plug It In

Follow simple steps to add it to your AI setup, choosing CPU or GPU for your hardware.

5
Pick Your Speed
πŸ’¬
Chat Right Away

Start talking to AI models immediately with less memory.

πŸš€
Tune for Power

Adjust settings for top speed on your machine.

6
▢️ Hit Go

Launch your AI and watch it handle longer talks without slowing down.

πŸŽ‰ AI Dreams Come True

Enjoy chatting with giant AI brains using everyday computer memory, faster and smoother than ever.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 11 to 11 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is tq-kv?

tq-kv delivers Google's TurboQuant KV cache compression in pure Rust, shrinking LLM attention keys to 2-4 bits for up to 14x memory savings on long contexts like 4096 tokens. It powers a CLI inference engine handling GGUF models with CUDA GPU support, perplexity eval (`tq-engine --perplexity`), and OpenAI-compatible HTTP serving (`tq-engine --serve`). Drop-in C FFI integrates with llama.cpp for instant KV compression via `--cache-type-k tq4`.

Why is it gaining traction?

Unlike basic TurboQuant ports, tq-kv fuses attention scores directly from compressed indices for 8.9x AVX2 speedups, adds temporal decay for 30% extra savings, and passes NIAH retrieval at 90% depth with +0.32% PPL on wikitext-2. Rust purity echoes github pure bash bible simplicity but targets rust pure LLM stacks, with presets like `extreme()` for edge deploys ahead of ICLR 2026.

Who should use this?

LLM server operators maxing consumer GPUs (RTX 3080+), llama.cpp users extending context on Q4 GGUF quants, or Rust inference builders swapping KV caches in candle-based pipelines. Ideal for Qwen 72B or Llama-3 8B deploys where VRAM caps at 10GB.

Verdict

Grab it for proven 595MB VRAM cuts on 72B models (99% tests pass, crates.io v0.5)β€”beats alternatives on GGUF + CUDA completeness. At 11 stars and 1.0% credibility, it's raw but benchmark-backed; prototype in non-prod until 1.0 stabilizes.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.