Indras-Mirror / llama.cpp-turboq-mtp
PublicFused TBQ4 Flash Attention + MTP + Shared Tensors for llama.cpp — 82+ tok/s with lossless 4.25 bpv KV cache at 200K context on RTX 4090
Performance-optimized fork of llama.cpp with fused quantized flash attention, multi-token prediction speculative decoding, tensor sharing, and advanced KV cache compression enabling high token rates on large contexts with GPUs like RTX 4090.
How It Works
You hear about a super-fast way to run powerful AI chatbots on your home computer, handling huge conversations without slowing down.
Download the special program that squeezes every bit of speed from your computer's graphics card.
Choose and download a ready-to-use AI model file that fits perfectly with the software.
Fire it up with a simple command, and your AI loads ready for massive memory chats at blazing speeds.
Ask anything, share long stories, and get instant replies even after hundreds of thousands of words.
Celebrate as your AI thinks and responds at 80+ words per second, making huge conversations feel effortless.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.