TheTom

vLLM Metal plugin powered by mlx-swift — high-performance LLM inference on Apple Silicon

35
1
100% credibility
Found Apr 24, 2026 at 56 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A native backend for serving large language models at high speed on Apple Silicon Macs, compatible with standard AI serving tools via an OpenAI-like interface.

How It Works

1
🔍 Discover fast AI for Mac

You find a tool that lets you run smart AI models super quickly on your Apple computer without slowing down.

2
📦 Easy setup

Use a simple command to add and install everything you need on your Mac.

3
⬇️ Download a model

Pick a smart AI model and let the tool grab it for you into your personal folder.

4
🚀 Launch your AI helper

Start the server with one command, and your local AI comes alive, ready for action.

5
💬 Chat and create

Connect your favorite apps or tools, ask questions, and get speedy smart replies.

🎉 Blazing AI on your Mac

Enjoy a powerful personal AI assistant running smoothly at home, faster than ever.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 56 to 35 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is vllm-swift?

vllm-swift is a vLLM plugin that enables high-performance LLM serving on Apple Silicon Macs via a native Swift and Metal backend powered by mlx-swift. It sidesteps Python in the inference hot path for drop-in compatibility with vLLM's OpenAI API, letting you serve models like Qwen3-4B-4bit with commands like `vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096`. Install via Homebrew tap or source, auto-download from HuggingFace, and get streaming chat completions at http://localhost:8000.

Why is it gaining traction?

It crushes vllm metal benchmarks on M5 Max, hitting 340 tok/s single-request on Qwen3-0.6B versus 142 for Python/MLX alternatives, scaling to 3.4k tok/s at 64 concurrency. TurboQuant+ KV compression squeezes 3-5x longer contexts with minimal perplexity hit, fixing common github vllm issues on vllm metal macos. Easy swift deploy vllm flows beat vllm metal docker hassles, with tool calling for Hermes agents and reasoning parsers out of the box.

Who should use this?

Apple Silicon Mac devs running local Qwen or Qwen3.5 inference for agent testing with Hermes or OpenCode. Ideal for ms swift vllm推理 prototypes where vllm github qwen speed matters more than broad model support. Skip if you need LoRA or non-Qwen arches—it's tuned for swift infer vllm on unified memory.

Verdict

Promising vllm metal github hack for Mac throughput wins, but 35 stars and 1.0% credibility signal early days—docs are solid, tests hit 97% coverage, yet limitations like no LoRA persist. Test via vllm github releases for your swift vllm 部署; production waits on maturity.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.