Bruce-Lee-LY / cuda_auto_tune
PublicNCU-driven iterative optimization workflow for CUDA/CUTLASS/Triton/CuTe DSL kernels.
This repository provides high-performance implementations of RMSNorm for AI workloads on GPUs, including benchmarks and profiling tools to iteratively optimize kernel speed using various DSLs.
How It Works
You find this helpful collection of tools designed to make heavy math calculations in AI models run much faster on powerful graphics processors.
Download the ready-to-use examples and performance demos to get started right away.
Launch the built-in tests to see how different fast-math methods compare in timing charts.
Use the smart checker to analyze exactly where your calculations are taking too long.
Follow the simple guides to adjust your math routines based on the checker's insights.
Re-run the checker and celebrate as your calculations now finish in record time.
Your AI math operations are now super optimized, saving time and boosting efficiency every run.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.