Aryagm

Aryagm / dflash-mlx

Public

Exact speculative decoding on Apple Silicon, powered by MLX.

328
30
100% credibility
Found Apr 20, 2026 at 328 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

This project speeds up text generation from certain AI language models on Apple Silicon Macs using an optimized technique called DFlash with the MLX framework.

How It Works

1
🔍 Discover fast AI on your Mac

You hear about a handy tool that makes small AI helpers generate text super quickly right on your Apple computer.

2
📥 Grab and set up easily

Follow simple steps to download and prepare it, like syncing a few files on your Mac.

3
🧠 Pick your AI brain

Choose a ready-made AI model or stick with the default one, and it grabs everything needed automatically.

4
Ask and get speedy answers

Type in a question or story starter, hit go, and see the AI respond lightning-fast with smart text.

5
💬 Chat like a friend

Switch to chat mode to have a back-and-forth conversation that feels natural and quick.

6
📊 Check the speed boost

Run a quick test to see just how much faster your AI is now compared to before.

🎉 Supercharged AI ready

Now you have a personal AI sidekick that thinks and replies way faster on your Mac!

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 328 to 328 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is dflash-mlx?

dflash-mlx brings DFlash speculative decoding to Apple Silicon, powered by MLX in Python. It uses a small draft model to propose token blocks conditioned on target model hidden states, then verifies the longest exact prefix in one pass for bit-for-bit identical output and higher throughput. Run it via CLI like `uv run dflash-mlx` for generation or `dflash-mlx-chat` for interactive sessions, or use the Python API with `DFlashGenerator` for custom inference.

Why is it gaining traction?

It delivers the fastest exact speculative decoding on Apple Silicon, with benchmarks showing 4x+ speedup over plain MLX-LM on Qwen3-4B tasks like GSM8K. Developers get plug-and-play support for Hugging Face draft models, optional quantization, and detailed metrics like acceptance rates without CUDA dependencies. The history tracking and benchmark CLIs make it dead simple to compare against llama.cpp or MLX baselines.

Who should use this?

ML engineers optimizing local LLM inference on M-series Macs for chat apps or code gen. Apple Silicon users evaluating github exact match search or github exact string search in RAG pipelines needing precise, fast decoding. Devs prototyping speculative setups with Qwen models before scaling.

Verdict

Grab it if you're on Apple Silicon chasing peak tok/s with exact output—327 stars and solid README make it worth a benchmark run. Alpha status and 1.0% credibility score mean expect rough edges; stick to supported models until more adapters land.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.