BenChaliah / NVFP4-on-4090-vLLM
PublicAdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with FP8 KV cache and custom decode kernels. This repo targets NVFP4 weights and keeps the entire decode path in FP8
AdaLLM is a specialized tool for running highly compressed AI language models on consumer NVIDIA GPUs like the RTX 4090, slashing memory needs while delivering solid performance through custom optimizations.
How It Works
You hear about a handy tool that lets you run powerful AI models on your high-end graphics card like the RTX 4090, using way less memory while keeping things speedy.
With one simple command, you download and prepare everything on your computer—no complicated steps needed.
Choose a ready-made AI brain from a trusted source, tell the tool to start it, and watch it load smoothly.
Type a question, story starter, or command, and get clever responses right away, just like talking to a smart friend.
Set up a private endpoint so other programs can connect and use your AI anytime.
Jump straight into asking questions and seeing answers stream live in your terminal.
Your AI runs blazing fast with tiny memory use, powering creative writing, coding help, or conversations at home.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.