HJCheng0602

A from-scratch Prefill/Decode disaggregation inference engine for LLMs

40
4
100% credibility
Found Apr 11, 2026 at 40 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

nanoPD is a specialized tool that accelerates AI language model responses by dividing the prompt processing and token generation phases across multiple graphics processors.

How It Works

1
🔍 Discover nanoPD

You stumble upon nanoPD on GitHub, a clever way to make AI chatbots respond much faster using multiple graphics cards.

2
💻 Prepare your setup

Download everything to your powerful computer with several graphics cards and get it ready with a few simple steps.

3
🧠 Load an AI model

Choose a smart AI like Qwen and connect it so your system knows how to think and respond.

4
Try the quick demo

Hit start on a single-graphics-card test and watch the AI generate answers right away, feeling the speed.

5
Pick your power mode
🟢
Simple mode

Stick to one card for easy, reliable chatting.

🔴
Full power mode

Spread the work across cards for blazing-fast results.

6
📊 Measure the magic

Run fun speed tests to see how much quicker your AI responds in different scenarios.

🚀 AI supercharged!

Celebrate as your chatbot generates text lightning-fast, handling tons of requests without slowing down.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 40 to 40 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is nanoPD?

nanoPD is a Python-based inference engine for LLMs that splits the compute-heavy prefill phase (prompt processing) from the memory-bound decode phase (token generation) across dedicated GPUs, dodging interference that kills throughput in standard setups. Like from-scratch GitHub repos such as llm from scratch github or ml from scratch github, it delivers a full serving stack with paged KV cache, chunked prompts, and adaptive routing via PyTorch and custom CUDA kernels. Users get single-GPU collocated mode or multi-GPU disaggregation, benchmarked on Qwen models up to 8B.

Why is it gaining traction?

It stands out with an analytical cost model that routes requests dynamically—no training needed—picking collocated or disaggregated paths based on live hardware metrics like inter-GPU bandwidth. Benchmarks show adaptive mode hitting 240 tok/s on 8x RTX 4090s under Poisson loads, beating pure collocated at scale. Demos like `demo_multiGPU.py` let you profile, route, and serve in minutes.

Who should use this?

Multi-GPU owners tuning LLM serving for production throughput, like AI infra engineers handling variable prompt lengths. Ideal for teams benchmarking Qwen/Llama on consumer hardware (RTX 40-series) or datacenter (H20), especially with NVLink for near-zero KV transfer costs.

Verdict

Worth forking for disaggregated experiments—docs cover every module in English/Chinese, benchmarks are reproducible—but 40 stars and 1.0% credibility signal early days; expect rough edges in scaling. Tinker if you're optimizing from-scratch github-style inference.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.