Mog9 / gpt2-inference
PublicA GPT-2 inference engine written from scratch in CUDA and C++. Implements custom CUDA kernels for tiled matrix multiplication, LayerNorm, fused attention, transformer blocks, KV cache management, autoregressive token generation, and end-to-end GPT-2 inference with profiling and benchmarking.
This project is an educational implementation of a complete AI text generator built from scratch using GPU programming. It recreates GPT-2 (a well-known language model) piece by piece, including how the AI understands words, pays attention to context, and generates new text one token at a time. The system runs on NVIDIA graphics cards and can produce around 190 words per second on a laptop GPU. It's designed to teach how language models actually work under the hood, with every component written from basic principles rather than using pre-made libraries.
How It Works
You discover a project that builds an AI text generator from the ground up, piece by piece.
You explore how each piece works—embeddings turn words into numbers, attention helps the AI focus on what matters.
You see how the AI processes your words step-by-step through layers of math and memory tricks.
You connect real AI brain patterns (called weights) and watch the system generate words one at a time.
The system produces text at impressive speed, and you now understand exactly how AI text generation works behind the scenes.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.