Indras-Mirror

Fused TBQ4 Flash Attention + MTP + Shared Tensors for llama.cpp — 82+ tok/s with lossless 4.25 bpv KV cache at 200K context on RTX 4090

44
3
100% credibility
Found May 15, 2026 at 44 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
C++
AI Summary

Performance-optimized fork of llama.cpp with fused quantized flash attention, multi-token prediction speculative decoding, tensor sharing, and advanced KV cache compression enabling high token rates on large contexts with GPUs like RTX 4090.

How It Works

1
🔍 Discover turbocharged AI

You hear about a super-fast way to run powerful AI chatbots on your home computer, handling huge conversations without slowing down.

2
📥 Grab the magic software

Download the special program that squeezes every bit of speed from your computer's graphics card.

3
🧠 Pick your smart brain

Choose and download a ready-to-use AI model file that fits perfectly with the software.

4
🚀 Launch your speed demon

Fire it up with a simple command, and your AI loads ready for massive memory chats at blazing speeds.

5
💬 Start chatting endlessly

Ask anything, share long stories, and get instant replies even after hundreds of thousands of words.

🎉 Lightning-fast magic

Celebrate as your AI thinks and responds at 80+ words per second, making huge conversations feel effortless.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 44 to 44 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

: 1. Detecting reasoning markers (## What is llama.cpp-turboq-mtp?

This is llama.cpp's automatic chat template parser -- a system that figures out how a given LLM formats its output without you having to manually specify the structure. When you're running models with different chat templates (think Qwen, Llama, DeepSeek-style formats), the parser detects whether the model uses reasoning tags like

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.