QwenLM

QwenLM / FlashQLA

Public

high-performance linear attention kernel library built on TileLang

317
21
100% credibility
Found Apr 30, 2026 at 317 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

FlashQLA is a specialized toolkit that accelerates key computations in large AI models for quicker training and usage on high-end graphics cards.

How It Works

1
πŸ“° Discover FlashQLA

You hear about FlashQLA from a blog or GitHub, a tool that makes AI brains think faster and smoother.

2
πŸ“₯ Get it ready

You easily add it to your setup with one quick step, and everything is prepared.

3
πŸš€ Plug into your AI

You swap in FlashQLA where your AI handles connections between ideas, making it super efficient.

4
⚑ Run your project

You start your AI tasks, and notice they complete much quicker right away.

5
πŸ“Š Measure the speed

You check the built-in tests to see clear proof of big time savings.

πŸŽ‰ AI supercharged

Your AI model now trains and responds blazing fast, saving hours of waiting.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 317 to 317 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is FlashQLA?

FlashQLA provides high-performance linear attention kernels for PyTorch, accelerating chunked prefill in models like Qwen on NVIDIA Hopper GPUs. It delivers 2-3x forward and 2x backward speedups over Triton kernels via fused operations and automatic context parallelism. Developers get high-level APIs like `chunk_gated_delta_rule` for Q/K/V/g/Ξ² inputs, supporting variable-length sequences and initial states.

Why is it gaining traction?

It outperforms FLA Triton and FlashInfer baselines in Qwen head configs (8-64 dims), especially for TP=1-8 pretraining with long sequences up to 32k tokens. Gate-driven intra-card parallelism auto-activates for small heads, maximizing SM utilization without manual tuning. Benchmarks on H200 highlight gains in high-performance linear algebra workloads.

Who should use this?

ML engineers training large LLMs on Hopper (SM90+) who need fast GDN prefill for agentic inference or long-context pretraining. Teams scaling Qwen3.x models with tensor parallelism, replacing slower linear attention in custom Triton/FlashInfer stacks.

Verdict

Worth benchmarking if you're on Hopper chasing high-performance computing github kernelsβ€”2x+ speedups are real per included scripts. But 1.0% credibility score and 317 stars reflect early maturity; docs are blog-linked, tests basic. MIT license invites experimentation.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.