redai-infra / PIPO

Public

Implementation of an efficient LLM architecture: the Pair-In / Pair-Out Model (PIPO)

89% credibility

Found May 29, 2026 at 21 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

PIPO (Pair-In, Pair-Out) is an academic research project that makes large language models faster by compressing pairs of tokens during inference. Built by researchers from Chinese universities and Xiaohongshu, it trains AI models to think more efficiently without sacrificing accuracy. The project includes training scripts, evaluation tools for math and coding benchmarks, and works with Qwen3.5 models. Users can download pre-trained checkpoints or train their own models using the provided scripts.

How It Works

📚 Discover the research

You find an academic paper about PIPO, a new technique that makes AI assistants answer questions faster by thinking in compressed pairs.

💻 Set up your environment

You download the code and install the tools needed to run the experiments, following simple setup instructions.

🤖 Get a pre-trained AI model

You download a Qwen3.5 AI model from HuggingFace - either a smaller 4B version or a larger 9B version.

⚡ Train with PIPO compression

You run the training script to teach the model to think in compressed token pairs, which is the core innovation of PIPO.

📊 Evaluate the results

You test the trained model on math problems, coding challenges, and other benchmarks to see how well it performs.

🎉 Enjoy faster AI responses

Your model now answers questions up to 2.6× faster while maintaining the same quality - the magic of PIPO compression.

Sign up to see the full architecture

4 more

Star Growth

See how this repo grew from 21 to 15 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is PIPO?

PIPO (Pair-In, Pair-Out) is a Python implementation of a novel LLM architecture that speeds up inference by compressing consecutive token pairs during processing. The core idea: instead of processing tokens one-by-one, it folds two input tokens into one latent representation, then unfolds that hidden state to predict additional output tokens. The system uses a lightweight confidence head to decide which draft tokens to accept, eliminating the expensive verification step required by traditional speculative decoding. Built on top of SGLang and Qwen3.5 backbones (4B and 9B variants), it includes complete evaluation pipelines for math reasoning (AIME 2025, GPQA), code generation (LiveCodeBench, Codeforces), and long-context understanding (LongBench-v2).

Why is it gaining traction?

The key innovation is unifying input-side latent compression with output-side multi-token prediction—two lines of work that have been developed independently until now. The results are concrete: up to 2.64x faster first-token latency and 2.07x faster per-token latency on challenging benchmarks, with a +7.15 point improvement in pass@4 on math problems. The confidence head replaces speculative decoding's expensive verifier with a simple threshold check, making the speedup practical rather than theoretical. Researchers working on long chain-of-thought reasoning will appreciate that this approach addresses the dominant inference cost without requiring architectural changes to existing models.

Who should use this?

LLM researchers focused on inference efficiency and latency optimization will find the most value here—particularly those working on math or coding tasks where long reasoning chains are common. If you're evaluating whether to deploy Qwen3.5 in production and need better throughput, this provides a solid reference implementation. Academic researchers exploring speculative decoding variants can build on the confidence head concept without implementing the full verification pass. Teams already using SGLang for serving will have an easier integration path. However, if you need a production-ready solution with comprehensive documentation and community support, the early-stage nature of this project means you'll be doing more of the heavy lifting yourself.

Verdict

PIPO addresses a real problem with a thoughtful architecture, and the published results are compelling. With a credibility score of 0.8999999761581421% and only 15 stars, this is a research-grade implementation rather than a production tool—expect to read the code carefully and contribute your own fixes. The documentation is functional but sparse, and the Qwen3.5-only constraint limits applicability until the team adds more backbone support. Worth exploring if you're pushing the boundaries of efficient LLM inference, but not ready for production deployment without significant validation work.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 15 stars

Penalty: Very new repo (2d): -70%

Bonus: AI verified quality (90%)

Account age: 129 days

Repo age: 2 days

License: Apache-2.0

Updated: May 29, 2026