BenChaliah

AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with FP8 KV cache and custom decode kernels. This repo targets NVFP4 weights and keeps the entire decode path in FP8

96
3
100% credibility
Found Feb 17, 2026 at 91 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

AdaLLM is a specialized tool for running highly compressed AI language models on consumer NVIDIA GPUs like the RTX 4090, slashing memory needs while delivering solid performance through custom optimizations.

How It Works

1
🔍 Discover AdaLLM

You hear about a handy tool that lets you run powerful AI models on your high-end graphics card like the RTX 4090, using way less memory while keeping things speedy.

2
📥 Get it set up

With one simple command, you download and prepare everything on your computer—no complicated steps needed.

3
🚀 Pick and launch a model

Choose a ready-made AI brain from a trusted source, tell the tool to start it, and watch it load smoothly.

4
💬 Start chatting or creating

Type a question, story starter, or command, and get clever responses right away, just like talking to a smart friend.

5
Choose your way to use it
🌐
Web server mode

Set up a private endpoint so other programs can connect and use your AI anytime.

⌨️
Direct chat

Jump straight into asking questions and seeing answers stream live in your terminal.

Enjoy fast, memory-smart AI

Your AI runs blazing fast with tiny memory use, powering creative writing, coding help, or conversations at home.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 91 to 96 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is NVFP4-on-4090-vLLM?

AdaLLM delivers NVFP4-first inference on Ada Lovelace GPUs like the RTX 4090, using FP8 KV cache and custom decode kernels that keep the entire decode path in FP8. Built in Python, it runs NVFP4-quantized models such as Qwen3 and Gemma3 with an OpenAI-compatible server via simple CLI commands like `adallm serve nvidia/Qwen3-8B-NVFP4` or `adallm run "prompt"`. Users get massive VRAM savings—benchmarks show ~240% reduction versus FP16 baselines on a single 4090—at the cost of 20-25% lower throughput.

Why is it gaining traction?

It stands out by unlocking NVFP4 weights on consumer 4090 hardware without FP16 fallbacks, pairing FP8 cache with custom kernels for coherent, efficient decode on long contexts. Developers notice the tiny VRAM footprint (e.g., 7.5GB peak for Qwen3-8B at batch=16) and easy setup—no complex configs, just pip install from GitHub and run. MoE support for Qwen3 variants adds flexibility, even if unoptimized.

Who should use this?

AI engineers deploying Qwen3 or Gemma3 NVFP4 models on RTX 4090 setups for local inference servers. Ideal for researchers batching prompts in low-VRAM environments or hobbyists running 27B models without multi-GPU hassle. Skip if you need peak speed or broad model support beyond tested ones.

Verdict

Try it for 4090 FP8 inference if VRAM is your bottleneck—benchmarks deliver real wins despite 87 stars and 1.0% credibility score signaling early maturity. Docs are solid with quickstarts and repro benches, but expect tweaks for production MoE.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.