AEON-7

AEON-7 / vllm-dflash

Public

DFlash vLLM for DGX Spark — Plug & Play Block-Diffusion Speculative Decoding

12
2
100% credibility
Found Apr 16, 2026 at 11 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Shell
AI Summary

This repository offers a ready-to-run container for serving a fast, uncensored 27B AI model with image understanding on NVIDIA DGX Spark hardware using advanced speed techniques.

How It Works

1
🔍 Discover fast AI for DGX Spark

You learn about a simple way to make your DGX Spark AI supercomputer give lightning-quick answers to questions and describe pictures.

2
📥 Download the AI brain

Grab the ready-optimized uncensored AI model from a trusted sharing site to your computer.

3
📝 Jot down your settings

Write a short note with the model's location and a private password to keep everything secure.

4
🚀 Launch your AI helper

Hit go with an easy start command, and watch as your personal fast AI server springs to life.

5
💬 Chat and share images

Talk to it like a friend by sending questions or pictures, getting smart replies right away.

🎉 Enjoy speedy magic

Feel the thrill of 2-3 times faster responses, turning slow waits into smooth, fun conversations.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 11 to 12 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is vllm-dflash?

vllm-dflash is a plug-and-play Docker setup for flash vllm on DGX Spark, enabling block-diffusion speculative decoding via DFlash to turbocharge inference. Written in shell with docker-compose files, it deploys OpenAI-compatible vLLM servers for models like Qwen3.5-27B in NVFP4 or BF16, handling text and vision prompts out of the box. You pull the image, mount your model, set env vars like DFLASH_DRAFTER, and curl localhost:8000/v1/chat/completions for 2-2.7x throughput gains over baseline.

Why is it gaining traction?

It crushes the DGX Spark's memory bandwidth bottleneck, jumping single-stream speeds from 12 to 33 tok/s with 15 speculative tokens, while scaling to 92 total tok/s at 8 concurrency. Devs love the zero-config entrypoint that auto-downloads drafters, tunes for Blackwell GPUs, and supports any vLLM model—turn it off for plain inference. Low TTFT (under 140ms) and 64K context make responsive apps feasible without custom hacks.

Who should use this?

AI engineers deploying local inference on DGX Spark for agentic workloads or RAG pipelines. Teams building OpenAI-compatible backends with vision models who need high single-stream throughput. Devs prototyping Qwen-scale LLMs without MoE overhead, especially on Blackwell hardware.

Verdict

Grab it if you're on DGX Spark—docs are thorough, setup takes 5 minutes, and perf numbers deliver. With 10 stars and 1.0% credibility, it's early but battle-tested for its niche; skip for production without your own benchmarks.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.