albond

Qwen3.5-122B-A10B on DGX Spark: 28.3 → 51 tok/s (+80%)

84
8
100% credibility
Found Apr 14, 2026 at 84 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

Scripts and tweaks to speed up a massive AI language model on NVIDIA DGX Spark hardware from 28 to 51 tokens per second.

How It Works

1
🔍 Discover faster AI

You hear about a way to make a huge AI model run much faster on your special NVIDIA computer.

2
📥 Grab the starter kit

Download the simple files that make everything work.

3
🧠 Download the AI brain

Get the smart model files so your AI can think.

4
🛠️ Run the magic setup

Click one button to prepare everything automatically with progress updates.

5
Everything ready!

Your setup finishes, now supercharged for lightning responses.

6
▶️ Start your AI helper

Launch it and connect to chat right away.

🎉 Blazing fast chats

Enjoy responses at 51 words per second — 80% faster than before!

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 84 to 84 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is DGX_Spark_Qwen3.5-122B-A10B-AR-INT4?

This Python project delivers optimized Docker images for running the Qwen3.5-122B-A10B INT4 model on a single NVIDIA DGX Spark, boosting inference from 28.3 to 51 tok/s (+80%) with 256K context support. It solves the bandwidth bottleneck on DGX Spark's GB10 GPU by combining hybrid INT4+FP8 weights, MTP-2 speculative decoding, and an INT8 LM head—no quality loss. Users get a one-command install script that downloads, patches, builds, and launches a production-ready vLLM server.

Why is it gaining traction?

It crushes baselines like plain vLLM AutoRound INT4 (28.3 tok/s) or llama.cpp Q5_K (23 tok/s) on DGX Spark, hitting 51 tok/s for Qwen3.5-122B-A10B-NVFP4 setups without multi-node hassle. Optional TurboQuant KV cache gives 4x capacity (1.4M tokens) for 5 concurrent 256K users at 39 tok/s. The automated `./install.sh` handles SM121 compilation quirks, making Blackwell-ready serving dead simple.

Who should use this?

AI engineers with DGX Spark deploying Qwen3.5-122B-A10B for single-user reasoning, coding, or agent tasks needing 50+ tok/s and long context. Ideal for low-concurrency prototypes where you want max throughput from 128GB unified memory without Ray clusters or NVFP4 slowdowns.

Verdict

Grab it if you own DGX Spark—51 tok/s is a real win for this 122B MoE, and the script just works. Low 1.0% credibility (84 stars) means it's niche and single-node only; test thoroughly before prod, but docs and benchmarks are solid for eval.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.