UT-InfraAI

UT-InfraAI / cuco

Public

An agent for CUDA compute-communication kernel co-design

14
1
100% credibility
Found Mar 07, 2026 at 14 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

CUCo is an AI-powered framework that automatically evolves high-performance GPU kernels by jointly optimizing computation and communication patterns.

How It Works

1
💡 Discover CUCo

You hear about a smart tool that automatically makes your GPU programs run much faster by rethinking how they compute and communicate.

2
🔧 Set it up quickly

Follow simple steps to install it on your computer, like any helpful app.

3
📝 Share your starting code

Pick your existing GPU program and let the tool understand it.

4
🚀 Launch the magic agents

Hit start and watch friendly AI agents analyze, transform, and evolve smarter versions of your code over generations.

5
📊 Watch progress unfold

Check colorful charts and trees showing how each new version improves speed and efficiency.

6
Pick the winners

Browse the best evolved programs that beat your original.

Run super-fast GPU code

Swap in the optimized version and celebrate up to 1.57x faster performance on your real workloads.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 14 to 14 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is cuco?

CUCo is a Python framework that uses AI agents to automatically transform and optimize CUDA kernels for multi-GPU setups, shifting host-driven NCCL communication to device-initiated primitives like GIN or LSA. It solves the pain of manually co-designing compute and communication, which often leaves latency on the table in workloads like MoE dispatch or KV cache transfers. Run it via CLI on your host-driven kernels to get evolved versions with up to 1.57x faster end-to-end performance, plus a web UI for visualizing evolution trees and Pareto fronts.

Why is it gaining traction?

Unlike static autotuners, this CUDA agent leverages LLMs (Anthropic Claude, OpenAI, Gemini) for evolutionary search without training data, discovering tricky overlaps and fusions humans miss. Developers dig the fast-path auto-conversion plus slow-path evolution that runs locally or on Slurm, with easy LLM backend swaps and workload extensibility. It's a GitHub repo blending agent GitHub Copilot vibes with serious HPC gains on real benchmarks like Flash Attention.

Who should use this?

ML systems engineers tuning multi-node GPU comms in training/inference pipelines, especially for MoE models or attention layers. CUDA devs at infra teams like those building agent DVR CUDA pipelines or custom collectives, frustrated by NCCL bottlenecks. Experimenters prototyping AI agent CUDA optimizations before scaling.

Verdict

Worth forking for GPU kernel hackers—strong arXiv paper, thorough docs, and examples make it playable despite 14 stars and 1.0% credibility score signaling early alpha status. Test on your workloads; pair with robust evals to validate speedups.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.