KernelFlow-ops / cuda-optimized-skill
PublicA CUDA kernel optimization toolkit for validation, benchmarking, Nsight Compute profiling, bottleneck analysis, and iterative tuning. It helps improve custom GPU operators with reproducible workflows and evidence-based performance comparison.
A set of skills and scripts for AI agents to iteratively optimize GPU kernels through correctness validation, benchmarking, profiling, and strategy-guided improvements across multiple backends.
How It Works
You hear about a helpful tool that makes your computer's math calculations run much faster by smart testing and tweaking.
You take your current math routine that feels slow and get ready to improve it.
You simply ask your AI assistant to use this special skill on your routine for a set number of improvement rounds – it feels easy and exciting!
The AI runs tests to make sure your routine gives correct results and measures its starting speed.
The AI deeply analyzes performance, suggests better ways, updates your routine, and repeats tests over several rounds to find the fastest version.
You receive the best improved routine that runs way faster, with proof of speedup and details on what made it better – your calculations now fly!
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.