Indras-Mirror / llama.cpp-turboq-mtp

Public

Fused TBQ4 Flash Attention + MTP + Shared Tensors for llama.cpp — 82+ tok/s with lossless 4.25 bpv KV cache at 200K context on RTX 4090

indrasmirror.aublog-mtp-shared-tensors-200k.html cuda flash-attention fwht kv-cache llama-cpp

100% credibility

Found May 15, 2026 at 44 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

C++

AI Summary

Performance-optimized fork of llama.cpp with fused quantized flash attention, multi-token prediction speculative decoding, tensor sharing, and advanced KV cache compression enabling high token rates on large contexts with GPUs like RTX 4090.

How It Works

🔍 Discover turbocharged AI

You hear about a super-fast way to run powerful AI chatbots on your home computer, handling huge conversations without slowing down.

📥 Grab the magic software

Download the special program that squeezes every bit of speed from your computer's graphics card.

🧠 Pick your smart brain

Choose and download a ready-to-use AI model file that fits perfectly with the software.

🚀 Launch your speed demon

Fire it up with a simple command, and your AI loads ready for massive memory chats at blazing speeds.

💬 Start chatting endlessly

Ask anything, share long stories, and get instant replies even after hundreds of thousands of words.

🎉 Lightning-fast magic

Celebrate as your AI thinks and responds at 80+ words per second, making huge conversations feel effortless.

Sign up to see the full architecture

4 more

Star Growth

See how this repo grew from 44 to 44 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

: 1. Detecting reasoning markers (## What is llama.cpp-turboq-mtp?

This is llama.cpp's automatic chat template parser -- a system that figures out how a given LLM formats its output without you having to manually specify the structure. When you're running models with different chat templates (think Qwen, Llama, DeepSeek-style formats), the parser detects whether the model uses reasoning tags like

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 44 stars

Bonus: AI verified quality (100%)

Account age: 1,005 days

Repo age: 9 days

License: MIT

Updated: May 15, 2026