onur-gokyildiz-bhi / tq-kv

Public

Pure Rust implementation of Google's TurboQuant (ICLR 2026) — KV cache compression for LLMs

100% credibility

Found Mar 31, 2026 at 11 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Rust

AI Summary

tq-kv is a Rust library for TurboQuant KV cache compression that enables efficient local inference of large language models by drastically reducing memory usage while preserving quality.

How It Works

🔍 Discover Memory Magic

You're excited to run big AI chatbots at home but frustrated by high memory use, then you find tq-kv promising huge savings.

📉 See the Wow Factor

Check impressive charts showing your AI's memory shrinks 7-15 times while keeping smart replies sharp and fast.

📥 Grab the Tools

Download the library and easy patches tailored for popular AI runners like llama.cpp.

🔧 Plug It In

Follow simple steps to add it to your AI setup, choosing CPU or GPU for your hardware.

Pick Your Speed

💬

Chat Right Away

Start talking to AI models immediately with less memory.

🚀

Tune for Power

Adjust settings for top speed on your machine.

▶️ Hit Go

Launch your AI and watch it handle longer talks without slowing down.

🎉 AI Dreams Come True

Enjoy chatting with giant AI brains using everyday computer memory, faster and smoother than ever.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 11 to 11 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is tq-kv?

tq-kv delivers Google's TurboQuant KV cache compression in pure Rust, shrinking LLM attention keys to 2-4 bits for up to 14x memory savings on long contexts like 4096 tokens. It powers a CLI inference engine handling GGUF models with CUDA GPU support, perplexity eval (`tq-engine --perplexity`), and OpenAI-compatible HTTP serving (`tq-engine --serve`). Drop-in C FFI integrates with llama.cpp for instant KV compression via `--cache-type-k tq4`.

Why is it gaining traction?

Unlike basic TurboQuant ports, tq-kv fuses attention scores directly from compressed indices for 8.9x AVX2 speedups, adds temporal decay for 30% extra savings, and passes NIAH retrieval at 90% depth with +0.32% PPL on wikitext-2. Rust purity echoes github pure bash bible simplicity but targets rust pure LLM stacks, with presets like `extreme()` for edge deploys ahead of ICLR 2026.

Who should use this?

LLM server operators maxing consumer GPUs (RTX 3080+), llama.cpp users extending context on Q4 GGUF quants, or Rust inference builders swapping KV caches in candle-based pipelines. Ideal for Qwen 72B or Llama-3 8B deploys where VRAM caps at 10GB.

Verdict

Grab it for proven 595MB VRAM cuts on 72B models (99% tests pass, crates.io v0.5)—beats alternatives on GGUF + CUDA completeness. At 11 stars and 1.0% credibility, it's raw but benchmark-backed; prototype in non-prod until 1.0 stabilizes.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 11 stars

Bonus: AI verified quality (100%)

Account age: 524 days

Repo age: 5 days

License: Apache-2.0

Updated: Mar 31, 2026