Anemll

Flash-MoE sidecar slot-bank runtime for large GGUF MoE models on Apple Silicon — llama.cpp fork

20
2
100% credibility
Found Apr 08, 2026 at 20 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
C++
AI Summary

A fork of llama.cpp optimized for running large Mixture-of-Experts AI models on Apple Silicon by streaming experts from disk.

How It Works

1
🔍 Discover huge AI for your Mac

You learn about a free tool that lets everyday Macs run massive smart models without needing endless memory by smartly loading only needed parts.

2
📥 Download the tool and a model

Grab the program and pick a big AI model file that fits your computer's power.

3
🔧 Prepare the model's helpers

Run a simple setup to split the model's brain into quick-access pieces stored on your drive.

4
🚀 Start chatting with the giant AI

Launch the tool with your Mac's settings and watch it think fast, pulling just what it needs from storage.

5
💬 Ask questions and get answers

Type your questions and see detailed, speedy responses from models too big for normal memory.

🎉 Unlock powerful AI magic

Your Mac now handles enormous AIs effortlessly, giving you expert help anytime without slowdowns.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 20 to 20 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is anemll-flash-llama.cpp?

This C++ fork of llama.cpp brings Flash-MoE inference to large GGUF Mixture-of-Experts (MoE) models on Apple Silicon, solving the RAM bottleneck for models like Qwen3.5-397B by splitting dense weights into the main GGUF file and streaming routed experts from an SSD sidecar. Users get a slot-bank runtime that caches recent experts in host memory, offloading dense parts to GPU with flags like --moe-sidecar and --moe-slot-bank, enabling local runs on M-series Macs without full model loading.

Why is it gaining traction?

It stands out by hitting 53 tokens/s decode on Qwen3.5-35B with 85% cache hit rates via temporal prefetch, far beyond stock llama.cpp for massive MoE models on consumer hardware. Developers dig the sidecar extraction tools, trace capture for oracle replay, and bank-sizing tables tailored to Mac RAM configs, making Flash-MoE workflows practical without custom backends.

Who should use this?

Apple Silicon owners inferenceing large GGUF MoE models like Kimi-K2.5 or Qwen3.5 who can't fit experts in RAM. ML engineers tuning Flash-MoE pipelines on 8-128GB M1-M5 Macs, or researchers benchmarking slot-bank vs. resident modes with --moe-trace.

Verdict

Grab it if you're on Apple hardware pushing MoE boundaries—benchmarks show real speedups—but at 1.0% credibility (20 stars) and early docs, expect tuning like -ub 1 for stability. Test your workflow before production.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.