Official repo for ”Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought“
This repository implements PEPO, a token-level reinforcement learning method to enhance multimodal chain-of-thought reasoning in vision-language models, with training and evaluation scripts for geometry and math datasets.
How It Works
You hear about this smart way to teach AI to solve picture-based math puzzles better, from a research paper.
You create a fresh space on your computer to experiment with AI learning.
You download geometry problems with diagrams to use as teaching examples.
You pick a vision-savvy AI like Qwen to begin improving on math reasoning.
With one command, you start training your AI to think step-by-step on visual puzzles.
You watch as your AI gets better at explaining and solving geometry challenges.
You run checks on math and logic datasets to see the smarter results.
Your trained AI now excels at multimodal chain-of-thought puzzles, ready for more!
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.