RobTand

RobTand / prismaquant

Public

Mixed-precision quantization for LLMs. Every layer refracts into a different format based on its sensitivity. Native compressed-tensors export, validated on Qwen3.6-35B-A3B MoE with MTP speculative decoding.

18
0
100% credibility
Found Apr 22, 2026 at 18 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

PrismaQuant shrinks large AI models by smartly choosing precision levels for each layer based on sensitivity, creating smaller files that run efficiently on standard tools.

How It Works

1
🔍 Discover PrismaQuant

You hear about a smart way to shrink huge AI models so they fit on everyday computers without losing smarts.

2
📥 Grab your AI model

Download the large language model you want to make smaller and point the tool to its folder.

3
⚗️ Analyze sensitivities

The tool studies your model to see which parts are extra important and need more detail.

4
⚙️ Choose your size goal

Pick how tiny you want the model—less space means room for longer chats or more models at once.

5
Create the slim version

It mixes smart shortcuts per layer, keeping quality high while slashing size by up to 70%.

🚀 Run supercharged AI

Load your new lightweight model and enjoy faster responses, bigger contexts, or multiple AIs sharing your hardware.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 18 to 18 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is prismaquant?

Prismaquant automates mixed precision quantization for LLMs, assigning formats like NVFP4, MXFP8, or BF16 to each layer based on measured sensitivity from calibration data. It tackles DRAM waste in naive PTQ by optimizing bit allocation via a knapsack solver, outputting native compressed-tensors checkpoints for vLLM serving—no custom kernels or patches needed. Run a simple pipeline script on models like Qwen3.6-35B to get 69% smaller artifacts with coherent generation and MTP speculative decoding.

Why is it gaining traction?

Unlike uniform tools or heuristic ignores, it uses per-layer Fisher traces and RTN errors for precise mixed precision quantization llm decisions, beating baselines like llm-compressor NVFP4 by 3pp on ARC while using 2GB less disk. Native vLLM support for MoE packed experts and formats makes deployment drop-in, and Pareto sweeps help tune budgets like 4.75 bpp where bits matter most.

Who should use this?

LLM serving engineers on Blackwell GPUs fitting MoE models to 24-128GB DRAM, maximizing KV cache for longer contexts or batches. Perfect for mixed precision quantization github users targeting compressed-tensors export with vLLM, especially Qwen/Mixtral workflows needing MTP decoding.

Verdict

Strong pick for sensitivity-driven mixed precision quantization on vLLM, with solid validation and quickstart—but 18 stars and 1.0% credibility signal early alpha status. Test on your models; docs and benchmarks make evaluation fast.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.