deepseek-ai / TileKernels

Public

A kernel library written in tilelang

553

100% credibility

Found Apr 23, 2026 at 553 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

TileKernels is a collection of high-performance GPU operations optimized for large language model training and inference using the TileLang domain-specific language.

How It Works

🔍 Discover faster AI helpers

You hear about a tool that makes AI models run much quicker on powerful computers.

📥 Grab the tool

Download and add it to your setup with a simple command, no hassle.

🔧 Link it to your AI

Connect it to your existing AI project so it uses the speedy parts.

⚡ See the speed boost

Run your AI training or chatting, and watch it fly through tasks super fast.

✅ Test and tweak

Try different AI features like smart routing or tiny data tricks, everything works smoothly.

🚀 Supercharged AI ready

Your AI now handles huge tasks quickly, saving time and power for bigger ideas.

Sign up to see the full architecture

4 more

Star Growth

See how this repo grew from 553 to 553 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is TileKernels?

TileKernels delivers a fused kernel library of high-performance GPU kernels for LLM ops like top-k MoE routing, FP8/FP4/E5M6 quantization casting, batched transposes, and gating mechanisms including Engram and Manifold HyperConnection. Written in Python via TileLang DSL on PyTorch, it equips devs with torch.autograd-compatible layers that saturate Hopper/Blackwell hardware bandwidth and compute for training/inference. Users pip-install and drop in custom kernels pushing LLM throughput without CUDA C++.

Why is it gaining traction?

It hits hardware perf ceilings on memory-bound LLM workloads—faster MoE dispatch, fused quant/SwiGLU—while letting Python devs author kernels with auto-optimizations, skipping kernel github source tweaks or math kernel library intel setups. 553 stars reflect appeal for kernel library python seekers tired of slow PyTorch fallbacks; internal DeepSeek use proves real-world speed in kernel hooking library scenarios.

Who should use this?

LLM engineers optimizing MoE models with top-k gating or per-token quant on SM90/SM100 GPUs. Perfect for DeepSeek-style training runs needing fused kernel library dispatch, or inference stacks fusing RMSNorm/gating without latency spikes.

Verdict

Grab it for bleeding-edge LLM perf on supported hardware, but alpha status means expect rough edges—basic docs, ongoing code quality fixes. 1.0% credibility score and 553 stars suggest test rigorously before prod; promising kernel library linux alternative as it matures.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

553

Stars

Forks

88,495

Followers

Base stars: 553 stars

Penalty: Very new repo (1d): -70%

Penalty: New repo with many stars: -90% (possible fake)

Bonus: AI verified quality (100%)

Account age: 918 days

Repo age: 1 days

License: MIT

Updated: Apr 23, 2026