tayoun

tayoun / flash-moe

Public

Run a 35B MoE model at 10+ tok/s on a $600 Mac mini. Pure C/Metal inference engine streaming experts from SSD on Apple Silicon

10
0
100% credibility
Found Mar 26, 2026 at 10 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Objective-C
AI Summary

An optimized engine for running large Mixture-of-Experts AI language models on Apple Silicon Macs by streaming specialized model parts from disk to enable high performance on low-memory hardware.

How It Works

1
🔍 Discover Fast AI on Your Mac

You hear about a way to run powerful AI chatbots super fast on everyday Apple computers like a Mac mini, even with limited memory.

2
📥 Grab the AI Brain Files

You download the free AI model files from a shared online library to your computer.

3
🛠️ Prepare the Model Pieces

You use simple tools to organize and ready the model pieces for quick loading from your hard drive.

4
⚙️ Build Your AI Runner

With one easy build step, you create the special runner that makes everything work smoothly on your Mac's chip.

5
🚀 Launch the Chat Server

You start the server, and your AI is ready to chat over the web right from your machine.

6
💬 Chat with Super Speed

You send messages and get smart, helpful replies in seconds, feeling the thrill of 11 tokens per second on affordable hardware.

AI Magic at Home

Now you have a blazing-fast personal AI assistant running production-quality chats on your Mac, saving time and money.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 10 to 10 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is flash-moe?

Flash-moe on GitHub is a pure C/Metal inference engine that runs 35B Mixture-of-Experts models like Qwen3.5 at 11+ tokens/second on a $600 M4 Mac mini with 16GB RAM, streaming routed experts from SSD to sidestep memory limits. It delivers production output with tool calling via an OpenAI-compatible server on port 8000. See the flash moe paper for the optimizations behind this SSD-aware MoE setup.

Why is it gaining traction?

It crushes baselines with 2.6x speedups on cheaper hardware, configurable active experts via --k, and dead-simple setup: build artifacts from Hugging Face snapshots, then curl the /v1/chat/completions endpoint. Devs love the sustained 11.5 tok/s and 2.5s TTFT without GPU farms, plus a chat TUI for quick tests—beats moergo or moezelweg alternatives for Apple Silicon.

Who should use this?

Apple Silicon owners prototyping local MoE like gemini-2.0-flash mode or marvin moeller flash on Mac minis, especially backend devs needing fast inference for tools or agents. Perfect for running github actions locally, workflows from branch manually, or copilot-style completions without cloud latency.

Verdict

Grab it if you're on M4/M3 and want moe flash vanguard speeds—bench.sh proves real-world gains. But 10 stars and 1.0% credibility mean it's raw; solid README/quickstart, but sparse tests. Fork and contribute to mature it.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.