Mog9

A GPT-2 inference engine written from scratch in CUDA and C++. Implements custom CUDA kernels for tiled matrix multiplication, LayerNorm, fused attention, transformer blocks, KV cache management, autoregressive token generation, and end-to-end GPT-2 inference with profiling and benchmarking.

22
1
89% credibility
Found May 19, 2026 at 31 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Cuda
AI Summary

This project is an educational implementation of a complete AI text generator built from scratch using GPU programming. It recreates GPT-2 (a well-known language model) piece by piece, including how the AI understands words, pays attention to context, and generates new text one token at a time. The system runs on NVIDIA graphics cards and can produce around 190 words per second on a laptop GPU. It's designed to teach how language models actually work under the hood, with every component written from basic principles rather than using pre-made libraries.

How It Works

1
💡 Curiosity strikes

You discover a project that builds an AI text generator from the ground up, piece by piece.

2
📚 Learning the building blocks

You explore how each piece works—embeddings turn words into numbers, attention helps the AI focus on what matters.

3
🔧 Watching the magic happen

You see how the AI processes your words step-by-step through layers of math and memory tricks.

4
🚀 Bringing it to life

You connect real AI brain patterns (called weights) and watch the system generate words one at a time.

✨ Your AI is talking

The system produces text at impressive speed, and you now understand exactly how AI text generation works behind the scenes.

Sign up to see the full architecture

3 more

Sign Up Free

Star Growth

See how this repo grew from 31 to 22 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is gpt2-inference?

A CUDA inference engine that reimplements GPT-2 entirely from scratch in C++ and CUDA, targeting developers who want to understand or optimize transformer execution at the hardware level. It builds a complete end-to-end pipeline from raw CUDA kernels, including custom tiled matrix multiplication, fused attention, KV cache management, and autoregressive generation. The project delivers a working GPT-2 Small implementation that hits ~190 tokens/sec on an RTX 3050 laptop GPU.

Why is it gaining traction?

This is the closest thing to reading Andrej Karpathy's GPT-2 implementation style but as actual runnable code. Rather than wrapping PyTorch, every operation--GEMM, LayerNorm, GELU, softmax--is implemented as a custom CUDA kernel. The fused attention pipeline (QK^T -> scale -> causal mask -> softmax) eliminates memory copies inside the attention hot path. For developers curious about the gap between high-level ML frameworks and actual hardware behavior, this is a rare hands-on example.

Who should use this?

ML engineers learning CUDA optimization will find concrete, working examples of shared-memory GEMM tiling, warp-level reductions, and fused kernels. Systems programmers evaluating transformer inference tradeoffs can inspect the actual kernel implementations rather than guessing at library internals. This is not production-ready infrastructure--it is a learning and experimentation platform.

Verdict?

At 22 stars with minimal documentation, this is early-stage code best suited for learning and exploration rather than production deployment. The implementation is solid and well-structured, but the community footprint is too small to validate long-term reliability. Use it to understand transformer CUDA internals, then decide if your production needs warrant building on top of it. The 0.9% credibility score reflects a nascent project with potential, not a battle-tested solution.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.