xaskasdf

High-efficiency LLM inference engine in C++/CUDA. Run Llama 70B on RTX 3090.

424
20
100% credibility
Found Feb 22, 2026 at 279 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
C++
AI Summary

NTransformer is an efficient engine for running large language models on single consumer GPUs by smartly managing memory across graphics card, system RAM, and optional direct storage access.

How It Works

1
🔍 Discover NTransformer

You learn about a clever tool that lets everyday gaming computers handle giant AI chatbots without needing supercomputers.

2
🖥️ Check Your Setup

You confirm your computer has a strong graphics card and runs Linux, so it's ready for big AI tasks.

3
📥 Grab an AI Model

You download a compact AI model file that contains all the smarts for chatting or creating text.

4
⚙️ Prepare Your Fast Drive (Optional)

For the biggest models, you copy the file to a speedy storage drive to make everything zoom even faster.

5
🚀 Launch and Chat

You start the program with your model file and begin typing questions or prompts, watching ideas flow out super quick.

6
Choose Your Style
🗣️
Casual Chat

Jump into back-and-forth conversations like talking to a smart friend.

📝
Create Text

Generate stories, answers, or test speeds with custom prompts.

🎉 AI Comes Alive

Your home computer delivers blazing-fast, smart responses, making powerful AI feel easy and magical.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 279 to 424 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is ntransformer?

ntransformer is a high-efficiency C++/CUDA inference engine that runs Llama 70B models on a single RTX 3090 with 24GB VRAM. It streams model layers through PCIe from RAM or NVMe storage, enabling massive LLMs on consumer hardware without PyTorch or cuBLAS dependencies. Use the CLI to generate text: `./ntransformer -m llama-70b.gguf -p "Hello" --streaming --skip-threshold 0.98` for chat, benchmarks, or self-speculative decoding.

Why is it gaining traction?

It squeezes 0.5 tok/s from quantized 70B on a 3090 via 3-tier caching (VRAM-resident, pinned RAM, NVMe fallback) and tricks like layer skipping (20/80 layers skipped) or self-speculative decoding using VRAM layers as a draft model. GGUF support covers Q4_K_M to F32 quants, with zero fluff—no extra libs, just CUDA Toolkit. Devs dig the 83x speedup over naive mmap baselines on limited rigs.

Who should use this?

AI hobbyists tweaking local Llama setups on RTX 3090/4090 cards, indie devs building offline LLM apps without cloud costs, or researchers testing 70B inference on desktops with 48GB+ RAM and spare NVMe. Ideal if you're okay with Linux tweaks and want raw speed over ease.

Verdict

Promising for bleeding-edge local 70B runs, but at 41 stars and 1.0% credibility, it's experimental—docs are README-deep, no tests visible, and NVMe mode needs risky system scripts (IOMMU off, driver patches). Try for benchmarks if you're handy with CUDA; otherwise, stick to llama.cpp.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.