1CatAI

1CatAI / 1Cat-vLLM

Public

vLLM fork for Tesla V100 (SM70) with AWQ 4-bit support, CUDA 12.8 build flow, and validated Qwen3.5 27B/35B deployment on multi-GPU V100.

60
12
100% credibility
Found Mar 10, 2026 at 17 stars 4x -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A specialized vLLM fork enabling AWQ 4-bit quantized large language model inference on Tesla V100 GPUs.

How It Works

1
🔍 Discover a way to revive old computers

You have powerful but outdated Tesla V100 computers and want to run the latest AI language models on them.

2
⚙️ Prepare your setup

Follow simple steps to get the right software environment ready on your machine.

3
🚀 Build your AI engine

Compile the special version that unlocks modern models for your older hardware with a few commands.

4
Test it works

Run a quick check to confirm everything is set up correctly.

5
🌐 Start your AI server

Launch the server on your V100 machines and connect them together if needed.

6
💬 Chat with AI models

Send questions to modern quantized models like Qwen and get fast responses.

🎉 Old hardware powers new AI

Your V100 computers now run cutting-edge language models efficiently, saving money and extending their life.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 17 to 60 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is 1Cat-vLLM?

1Cat-vLLM is a specialized vLLM fork that unlocks AWQ 4-bit quantized inference on Tesla V100 GPUs, which upstream vLLM skips due to SM70 limitations. Built in Python with CUDA 12.8 and PyTorch integration, it lets you deploy large models like Qwen3.5 27B/35B across multi-GPU V100 clusters using an OpenAI-compatible API server. Users get fast text and vision serving on legacy hardware, complete with torch.compile and CUDA graph support for steady-state performance after warmup.

Why is it gaining traction?

It directly tackles github vllm issues around older Volta GPUs, unlike intel vllm fork or vllm rocm fork alternatives that target newer arches. Developers appreciate the vllm fork github simplicity—no spawn vs fork debates or windows fork hacks—plus validated vllm github docker examples, install guides, and requirements for reproducible builds. The 4-bit AWQ path delivers solid throughput (e.g., 50+ tokens/s decode on 4x V100), making it a quick win for squeezing value from existing clusters without vllm github copilot distractions.

Who should use this?

Ops engineers with V100 fleets running inference workloads on 27B/35B Qwen models. Perfect for research teams or cost-focused startups deploying AWQ-quantized LLMs in multi-GPU setups where upgrading to SM75+ isn't feasible, especially needing vllm github releases stability and examples for OpenAI endpoints.

Verdict

Grab this if you've got V100s and want 4-bit AWQ without hardware swaps—follow the detailed vllm github repo build flow for CUDA 12.8. At 15 stars and 1.0% credibility, it's an early-stage fork with strong validation but light testing; pair it with upstream vllm github repository for production confidence. Worth a test deploy on your cluster.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.