Mog9 / kernel-fusion
PublicA CUDA Python experiment demonstrating kernel fusion by combining ReLU and LayerNorm into a single GPU pass and comparing it against the unfused multi-kernel pipeline.
This project compares standard separate steps for AI operations ReLU and LayerNorm against a single combined fused version to demonstrate dramatic speed and memory efficiency gains on NVIDIA GPUs.
How It Works
You come across a neat project online that shows two ways to do common AI math – one slow and one super fast.
Download the small set of files to a folder on your computer.
Add a couple of free math helpers so everything runs smoothly on your setup.
Run the main program and let it quickly test thousands of calculations to compare the methods.
See a printed table proving both ways give the same results, but the smart combined way is over 5 times faster.
A colorful chart pops up showing less memory use and higher speed for the fused approach.
You now understand how combining steps cuts memory trips and boosts AI performance – ready to apply this idea!
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.