albond / DGX_Spark_Qwen3.5-122B-A10B-AR-INT4

Public

Qwen3.5-122B-A10B on DGX Spark: 28.3 → 51 tok/s (+80%)

100% credibility

Found Apr 14, 2026 at 84 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

Scripts and tweaks to speed up a massive AI language model on NVIDIA DGX Spark hardware from 28 to 51 tokens per second.

How It Works

🔍 Discover faster AI

You hear about a way to make a huge AI model run much faster on your special NVIDIA computer.

📥 Grab the starter kit

Download the simple files that make everything work.

🧠 Download the AI brain

Get the smart model files so your AI can think.

🛠️ Run the magic setup

Click one button to prepare everything automatically with progress updates.

⚡ Everything ready!

Your setup finishes, now supercharged for lightning responses.

▶️ Start your AI helper

Launch it and connect to chat right away.

🎉 Blazing fast chats

Enjoy responses at 51 words per second — 80% faster than before!

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 84 to 84 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is DGX_Spark_Qwen3.5-122B-A10B-AR-INT4?

This Python project delivers optimized Docker images for running the Qwen3.5-122B-A10B INT4 model on a single NVIDIA DGX Spark, boosting inference from 28.3 to 51 tok/s (+80%) with 256K context support. It solves the bandwidth bottleneck on DGX Spark's GB10 GPU by combining hybrid INT4+FP8 weights, MTP-2 speculative decoding, and an INT8 LM head—no quality loss. Users get a one-command install script that downloads, patches, builds, and launches a production-ready vLLM server.

Why is it gaining traction?

It crushes baselines like plain vLLM AutoRound INT4 (28.3 tok/s) or llama.cpp Q5_K (23 tok/s) on DGX Spark, hitting 51 tok/s for Qwen3.5-122B-A10B-NVFP4 setups without multi-node hassle. Optional TurboQuant KV cache gives 4x capacity (1.4M tokens) for 5 concurrent 256K users at 39 tok/s. The automated `./install.sh` handles SM121 compilation quirks, making Blackwell-ready serving dead simple.

Who should use this?

AI engineers with DGX Spark deploying Qwen3.5-122B-A10B for single-user reasoning, coding, or agent tasks needing 50+ tok/s and long context. Ideal for low-concurrency prototypes where you want max throughput from 128GB unified memory without Ray clusters or NVFP4 slowdowns.

Verdict

Grab it if you own DGX Spark—51 tok/s is a real win for this 122B MoE, and the script just works. Low 1.0% credibility (84 stars) means it's niche and single-node only; test thoroughly before prod, but docs and benchmarks are solid for eval.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 84 stars

Bonus: AI verified quality (100%)

Account age: 4,789 days

Repo age: 9 days

License: Apache-2.0

Updated: Apr 14, 2026