inclusionAI

inclusionAI / cuLA

Public

CUDA kernels for linear attention variants, written in CuTe DSL and CUTLASS C++.

58
2
100% credibility
Found Apr 02, 2026 at 58 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

cuLA is a Python library offering high-performance CUDA kernels for linear attention variants, designed as a drop-in replacement for flash-linear-attention to accelerate long-context language model workloads on NVIDIA Hopper and Blackwell GPUs.

How It Works

1
🔍 Discover cuLA

You hear about cuLA, a tool that speeds up AI models handling long conversations by making attention calculations faster on powerful NVIDIA GPUs.

2
📥 Grab and set up

Download the free library and install it alongside your existing AI tools with a few simple commands.

3
Swap one line

Change just one line in your code to use cuLA's faster building blocks instead of the old ones.

4
▶️ Run your model

Hit run on your AI project, and watch it process longer texts much quicker.

5
📊 Check the speedup

Run quick tests to confirm your model is now blazing fast with real numbers.

🚀 Enjoy faster AI

Your long-context AI models train and run smoother and quicker, ready for bigger challenges.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 58 to 58 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is cuLA?

cuLA delivers high-performance CUDA kernels for linear attention variants like KDA and Lightning Attention, optimized for NVIDIA Hopper and Blackwell GPUs. Written as a Python package with C++ extensions using CuTe DSL and CUTLASS, it tackles the quadratic cost of standard attention in long-context LLMs by enabling linear-time state updates. Developers get a one-line import swap into flash-linear-attention setups for immediate speedups on workloads up to 32k tokens.

Why is it gaining traction?

Benchmarks show 1.3-1.9x speedups over Triton baselines on GB200/H200 for fixed and variable-length sequences, with CUDA kernels in Python that match PyTorch workflows. It supports fused forwards, chunked processing, and safe gating for production stability, plus GitHub Actions for easy builds—ideal for cuda kernels examples and tutorials. Early community contributions target further tuning via agentic optimization.

Who should use this?

LLM engineers training long-context models on Hopper/Blackwell hardware, especially those extending flash-linear-attention. Fine-tuners handling variable batch sizes or serving workloads with cuLA's decode kernels will see quick wins. PyTorch devs prototyping cuda kernels pytorch integrations without rewriting Triton code.

Verdict

Grab it for bleeding-edge perf on modern NVIDIA GPUs, but expect API evolution—58 stars and 1.0% credibility score mark it as early alpha. Benchmarks are reproducible via GitHub samples; pair with tests before prod.

(187 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.