MoonshotAI

MoonshotAI / FlashKDA

Public

FlashKDA: high-performance Kimi Delta Attention kernels

372
25
100% credibility
Found Apr 23, 2026 at 372 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Cuda
AI Summary

FlashKDA offers optimized building blocks for a specific type of AI attention mechanism, designed to run faster on advanced NVIDIA graphics cards within PyTorch projects.

How It Works

1
🔍 Discover FlashKDA

You hear about a special tool that makes AI models think and respond much faster on powerful computers.

2
Prepare your computer

You check that your high-end graphics card and latest software are ready for the speedup magic.

3
📥 Add the tool

You download it and set it up easily so it's ready to boost your projects.

4
🔌 Link to your AI

You connect it seamlessly to your AI attention system, feeling the power unlock instantly.

5
📊 Test the speed

You run quick checks with your data to measure how much faster everything works now.

🚀 Faster AI achieved!

Your AI handles complex tasks blazingly quick, saving time and opening new possibilities.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 372 to 372 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is FlashKDA?

FlashKDA delivers high-performance CUDA kernels for Kimi Delta Attention, a recurrent attention mechanism that computes gated delta rules with query-key normalization and state carryover. Developers get a drop-in PyTorch extension that accelerates the forward pass on SM90 GPUs like H100 or H20, handling fixed or variable-length sequences via a simple `flash_kda.fwd` call or auto-dispatch in flash-linear-attention's `chunk_kda`. It requires CUDA 12.9+ and PyTorch 2.4+, outputting attention results plus optional final states.

Why is it gaining traction?

It crushes Triton backends in benchmarks, delivering 2-4x speedups on fp32 states for long sequences up to 8k tokens across 96 heads. The seamless integration with flash-linear-attention means you enable it via an env var and see gains without code changes, plus solid dispatch logging for debugging misses. Hopper-specific tiling and workspace management keep memory tight.

Who should use this?

ML engineers fine-tuning or inferring Kimi-style models with delta attention on H100/H20 clusters. Ideal for teams scaling recurrent linear attention in long-context LLMs where Triton lags, especially variable-length batches in training or serving pipelines.

Verdict

Grab it if you're on supported Hopper hardware and using flash-linear-attention—benchmarks and tests prove real wins, with clear docs and easy install. At 372 stars and 1.0% credibility, it's early but credible from MoonshotAI; test thoroughly before prod.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.