sail-sg

sail-sg / Stable-RL

Public

Rethinking the Trust Region in LLM Reinforcement Learning

39
4
100% credibility
Found Feb 06, 2026 at 21 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

Stable-RL is a research codebase implementing improved reinforcement learning algorithms like DPPO for more stable training of large language models.

How It Works

1
🔍 Discover Stable RL

You find this tool while reading about better ways to train smart AI language helpers that stay reliable during learning.

2
📥 Get the starter kit

Download the ready-to-use package with examples and guides to make powerful AI training easy.

3
🛠️ Set up your workspace

Pick a simple container image that has everything you need, so you can focus on your project.

4
📚 Prepare your data and model

Load your conversation examples and base AI model, then tweak a few settings for your goals.

5
🚀 Start stable training

Hit launch and watch your AI learn steadily without wild swings, thanks to smarter safety boundaries.

6
📊 Watch progress and adjust

Check charts showing smooth improvements, and fine-tune as your AI gets smarter at tasks.

🎉 Enjoy reliable AI rewards

Your trained language helper now performs consistently better on math and reasoning, ready for real use.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 21 to 39 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is Stable-RL?

Stable-RL rethinks the trust region in PPO for LLM reinforcement learning, replacing volatile probability ratios with stable metrics like Total Variation divergence to prevent training collapse from inference mismatches. Developers get a drop-in upgrade for RLHF/RLAIF pipelines, supporting FSDP and Megatron backends with vLLM/SGLang inference, Docker images for quick setup, and scripts to convert Hugging Face models to optimized formats. Built in Python, it delivers reliable training on math/reasoning tasks using Qwen MoE models without heuristic clipping hacks.

Why is it gaining traction?

Unlike standard PPO or GRPO, which over-penalize rare tokens and allow destabilizing updates on common ones, Stable-RL's DPPO variants enforce principled policy divergence bounds for faster convergence and higher scores on AIME24 benchmarks. Users notice immediate stability gains—no more growing mismatches or model crashes—plus easy integration via config tweaks and shell scripts for async off-policy setups. It's a practical fix for LLM RL pitfalls, backed by arXiv preprint visuals showing real volatility issues.

Who should use this?

RL engineers fine-tuning LLMs for math or tool-use tasks, especially with 7B-30B MoE models like Qwen, who hit PPO instability in colocated trainer-rollout flows. Ideal for teams using Ray for disaggregated training, needing elastic rollouts on heterogeneous hardware without rebuilding comm groups every step.

Verdict

Try it if you're rethinking trust regions in LLM RL—simple swaps yield stable baselines outperforming GRPO, but low 1.0% credibility from 31 stars signals early-stage maturity; pair with the paper for configs and expect docs/tests to evolve. Solid for experiments, not production yet.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.