0xSero

How much experts do we need to serve a model?

47
3
100% credibility
Found Mar 21, 2026 at 47 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

REAP-swap is a vLLM server extension that uses REAP observation data to dynamically optimize GPU-resident experts for Mixture-of-Experts models, reducing CPU-GPU transfers and improving inference speed on memory-constrained hardware.

How It Works

1
🔍 Discover REAP-swap

You hear about this tool while looking for ways to make huge AI models run faster on your home computer without buying fancy new hardware.

2
💬 Gather your chat history

You pull together your past conversations with AI assistants to capture the kinds of questions you usually ask.

3
📊 Analyze usage patterns

You run a quick check on those chats to spot which parts of the AI get used the most in your daily life.

4
🧠 Create your smart plan

This generates a personalized guide telling the AI exactly which helpful pieces to keep ready in fast memory for you.

5
🚀 Start your AI helper

You launch the AI on your computer using your custom plan, and it's ready to chat over the web.

6
🔄 Prepare for each chat

Before asking a question, you give a quick hint about the topic so it loads the best parts upfront.

Lightning-fast answers

Your AI responds much quicker with no delays, making chatting feel smooth and natural every time.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 47 to 47 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is reap-expert-swap?

REAP-swap runs a vLLM server that smartly offloads Mixture-of-Experts (MoE) model parameters to CPU while keeping workload-hot experts in GPU VRAM. For huge models like Qwen2.5-30B-A3B that overflow consumer GPUs, it uses calibration data from your chats to preload the right experts, slashing CPU-GPU transfers. Built in Python, it extends the OpenAI API with endpoints like POST /swap_active_set for per-request expert swaps.

Why is it gaining traction?

It beats vLLM's default tail-layer strategy with 44% faster time-to-first-token and 14% quicker prefill on 8x RTX 3090s, all while preserving full model quality—no pruning needed. The hook: extract your AI coding history, run REAP observations, load a plan JSON, and serve optimized for your prompts, much like how GitHub Copilot helps developers route to expert code suggestions. Tracks router misses via GET /router_misses to verify coverage.

Who should use this?

AI infrastructure engineers serving 30B+ MoE models on VRAM-constrained clusters, tuning for coding or chat workloads where expert activation varies. Ideal for teams building Copilot-like agents, analyzing how GitHub Copilot works or how GitHub Actions work in CI for model eval. Skip if you're not patching vLLM or calibrating data.

Verdict

Promising prototype for expert offloading (47 stars, 1.0% credibility), with solid benchmarks and docs but requires vLLM patches and external REAP tools—early maturity means test thoroughly. Fork it if MoE serving bottlenecks your setup; otherwise, watch for stability.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.