QwenLM / FlashQLA

Public

high-performance linear attention kernel library built on TileLang

317

100% credibility

Found Apr 30, 2026 at 317 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

FlashQLA is a specialized toolkit that accelerates key computations in large AI models for quicker training and usage on high-end graphics cards.

How It Works

📰 Discover FlashQLA

You hear about FlashQLA from a blog or GitHub, a tool that makes AI brains think faster and smoother.

📥 Get it ready

You easily add it to your setup with one quick step, and everything is prepared.

🚀 Plug into your AI

You swap in FlashQLA where your AI handles connections between ideas, making it super efficient.

⚡ Run your project

You start your AI tasks, and notice they complete much quicker right away.

📊 Measure the speed

You check the built-in tests to see clear proof of big time savings.

🎉 AI supercharged

Your AI model now trains and responds blazing fast, saving hours of waiting.

Sign up to see the full architecture

4 more

Star Growth

See how this repo grew from 317 to 317 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is FlashQLA?

FlashQLA provides high-performance linear attention kernels for PyTorch, accelerating chunked prefill in models like Qwen on NVIDIA Hopper GPUs. It delivers 2-3x forward and 2x backward speedups over Triton kernels via fused operations and automatic context parallelism. Developers get high-level APIs like `chunk_gated_delta_rule` for Q/K/V/g/β inputs, supporting variable-length sequences and initial states.

Why is it gaining traction?

It outperforms FLA Triton and FlashInfer baselines in Qwen head configs (8-64 dims), especially for TP=1-8 pretraining with long sequences up to 32k tokens. Gate-driven intra-card parallelism auto-activates for small heads, maximizing SM utilization without manual tuning. Benchmarks on H200 highlight gains in high-performance linear algebra workloads.

Who should use this?

ML engineers training large LLMs on Hopper (SM90+) who need fast GDN prefill for agentic inference or long-context pretraining. Teams scaling Qwen3.x models with tensor parallelism, replacing slower linear attention in custom Triton/FlashInfer stacks.

Verdict

Worth benchmarking if you're on Hopper chasing high-performance computing github kernels—2x+ speedups are real per included scripts. But 1.0% credibility score and 317 stars reflect early maturity; docs are blog-linked, tests basic. MIT license invites experimentation.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

317

Stars

Forks

16,519

Followers

Base stars: 317 stars

Bonus: AI verified quality (100%)

Account age: 1,002 days

Repo age: 6 days

License: MIT

Updated: Apr 30, 2026