pmerolla

pmerolla / fomoe

Public

Fast Opportunistic Mixture-Of-Experts. From-scratch C/HIP MoE inference with multi-tier caching and cache-aware routing. First ever example of running Qwen3.5-397B at 5–9 tok/s on a $2,100 desktop.

10
1
100% credibility
Found Mar 24, 2026 at 10 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
C
AI Summary

FOMOE is a high-performance inference engine for running massive Mixture-of-Experts language models like Qwen3.5-397B locally on affordable consumer desktops using clever caching and dual-GPU techniques.

How It Works

1
🔍 Discover fast local AI

You hear about FOMOE, a way to run giant AI models like a 400-billion-parameter brain on a simple desktop computer for just $2,100.

2
🛒 Get your setup ready

Pick up the recommended computer parts or build one—two graphics cards, fast storage, and everyday components that fit your budget.

3
📥 Grab the AI files

Download the model weights, which include smart shortcuts for the most-used parts to make everything speedy from the start.

4
🚀 Launch with one click

Run a simple command to unpack the experts and start your AI—watch it warm up the fast memory caches automatically.

5
💬 Chat away

Type messages in the interactive chat, and get thoughtful replies at 5-9 words per second, feeling the power of huge AI right on your machine.

🎉 AI superpowers unlocked

Enjoy lightning-fast, private conversations with world-class intelligence, all without expensive servers or waiting for the cloud.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 10 to 10 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is fomoe?

FOMOE delivers blazing-fast inference for massive Mixture-of-Experts models like Qwen3.5-397B, hitting 5-9 tokens/second on a $2,100 AMD desktop rig with dual RX 9060 XT GPUs. Built from scratch in C with HIP for ROCm, it tackles the nightmare of loading 218GB of sparse expert weights from NVMe by using multi-tier caching across VRAM, DRAM, and opportunistic substitutions. Users get interactive chat, generation, perplexity eval, and frequency profiling via simple CLI commands like `./qwen-moe chat` or `./qwen-moe ppl`.

Why is it gaining traction?

It crushes alternatives by enabling desktop-speed runs of 397B models without enterprise hardware, using cache-aware routing to swap uncached experts for close matches (with tunable quality tradeoffs) and ping-pong GPU alternation for doubled cache capacity. Background NVMe prefetch and warmup seeding make cold starts snappy, while full-fidelity mode ensures no compromises when needed. Devs dig the fast GitHub downloads for GGUF models and seamless AMD support—no CUDA lock-in.

Who should use this?

AMD GPU owners benchmarking huge MoE LLMs locally, AI hobbyists chasing high tok/s on budgets under $2,100, or researchers evaluating Qwen variants via perplexity on WikiText. Perfect for solo devs ditching cloud costs for offline chat or generation workflows.

Verdict

Grab it if you've got ROCm-ready AMD GPUs—demo speeds are legit for early access. With 10 stars and 1.0% credibility, it's raw but MIT-licensed with solid README; test on small prompts first as maturity lags. (198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.