tonbistudio

Running a 32 GB AI model on 28 GB of memory — MoE expert streaming from NVMe SSD on Windows

17
3
100% credibility
Found Apr 08, 2026 at 17 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
C
AI Summary

This repository provides instructions, benchmarks, and tools for running large Mixture-of-Experts AI models on Windows PCs with limited memory by streaming expert weights from an NVMe SSD.

How It Works

1
🔍 Discover the idea

You learn about a clever way to run huge AI chatbots on your regular gaming PC, even if it doesn't have tons of memory.

2
💻 Check your gear

Look at your Windows PC to confirm it has an NVIDIA graphics card, at least 16GB memory, and a speedy solid-state drive.

3
📥 Download essentials

Get the free AI runner program and a large AI model file, saving them in a simple folder on your drive.

4
🚀 Launch with smart streaming

Fire up the AI using a quick command that keeps the main parts in your graphics memory and pulls extra pieces from your fast drive on the fly.

5
💬 Chat away

Start asking questions and see the AI respond with smart text at a snappy pace, like 2-4 words per second.

🎉 Huge AI unlocked!

Celebrate running a massive 32GB AI smoothly on your budget setup, stretching your PC's power to the max.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 17 to 17 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is moe-ssd-streaming-windows?

This project lets you run massive 32GB Mixture-of-Experts (MoE) AI models on Windows PCs with just 28GB total memory (12GB VRAM + 16GB RAM) by streaming expert weights from an NVMe SSD in real-time. Using llama.cpp and simple CLI flags like `-ot` to offload experts to CPU, it leverages the OS page cache for hot experts in RAM and cold loads from SSD at 1.5GB/s. Written in C with benchmarks and monitoring scripts, it delivers 2.5-4.3 tokens/second on budget gaming hardware like RTX 3060.

Why is it gaining traction?

It stands out by turning consumer NVMe SSDs into virtual RAM for local LLM inference, letting you run models locally without pricy upgrades—think running GitHub Copilot locally or running models locally on par with running GitHub workflows locally. Developers dig the practical proofs via C benchmarks testing SSD throughput for expert loads, plus ready scripts for Qwen3-30B-A3B GGUF files. The hook: real results on PCIe-limited setups, with tips to check your NVMe link width/speed.

Who should use this?

AI hobbyists and indie devs on Windows with NVIDIA GPUs (6GB+ VRAM) and NVMe SSDs experimenting with large MoE models via llama.cpp. Ideal for those pushing local inference limits, like running model engine setups on gaming rigs without full model fits in RAM. Skip if you're on macOS/Linux or lack CUDA.

Verdict

Worth a spin for Windows local LLM tinkerers—solid README, scripts, and C benchmarks make setup straightforward despite 17 stars and 1.0% credibility score signaling early-stage maturity. Test your hardware first; scales with faster SSDs and more RAM.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.