Official implementation of BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning. BandPO replaces canonical clipping (PPO/GRPO) with dynamic bounds to resolve exploration bottlenecks and prevent entropy collapse.
BandPO is an open-source reinforcement learning framework that improves training stability for large language models on math and reasoning tasks using dynamic probability bounds.
How It Works
You find this project through a research paper sharing a smarter way to teach AI models math and reasoning skills.
Set up a quiet workspace on your computer where your AI can learn safely and quickly.
Download math problems and starter AI brains so your model has examples to practice with.
Click to begin training, watching your AI practice solving problems step by step.
Your model gets smarter at math, scoring higher on tough tests as it learns from mistakes.
Adjust settings like practice speed or focus areas to make learning even better.
Your trained AI now solves complex math problems reliably, ready for real use!
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.