thunlp

thunlp / OPD

Public

Official repository for the paper "Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe"

46
0
100% credibility
Found Apr 16, 2026 at 50 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A research project providing code and scripts to train large language models using on-policy distillation techniques on math datasets.

How It Works

1
πŸ” Discover better AI training

You hear about a smart way to make AI models smarter at math by learning from a teacher model.

2
πŸ› οΈ Prepare your workspace

You set up a simple environment with the needed tools so everything runs smoothly on your computer.

3
πŸ“š Gather math examples

You collect math problems and answers to use as learning material for your AI.

4
πŸŽ“ Generate teacher answers

You ask a strong AI teacher to create helpful responses to your math problems.

5
πŸš€ Start smart training

You launch the special training where the student AI learns directly from the teacher's feedback.

6
πŸ“Š Test on tough problems

You check how well your improved AI handles challenging math tests.

πŸ† AI masters math!

Your AI now solves harder problems better, ready for real use.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 50 to 46 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is OPD?

OPD is the official repository for rethinking on-policy distillation of large language models, delivering scripts to train students on teacher-provided token rewards while fixing common failures like incompatible thinking patterns. Developers run bash commands for distillation, SFT rollouts, or GRPO RL on math datasets, with Docker support across CUDA, NPU, and ROCm for easy official repository Linux/Ubuntu setups. Python-based, it yields stronger small models via validated recipes like cold starts and prompt alignment.

Why is it gaining traction?

Unlike generic RL tools, OPD focuses on OPD-specific pitfalls with token-top-K strategies and weighting modes users tweak via env vars, enabling quick recovery of failing runs. Reproducible SLURM/bash workflows and verl/LlamaFactory integration cut experimentation time, while evaluation pipelines reuse JustRL grading. Docker official repository images handle hardware quirks, drawing devs chasing distillation edges without custom hacks.

Who should use this?

LLM researchers distilling reasoning from 7B+ teachers to 1-4B students on benchmarks like DAPO-Math or AIME. Suited for academic teams probing weak-to-strong gaps or industry folks optimizing math solvers via on-policy signals, skipping heavy RLHF.

Verdict

Grab it for cutting-edge OPD if you're in distillation researchβ€”1.0% credibility and 46 stars signal early maturity with paper-led docs, but bash simplicity and Docker make it low-risk to prototype. Polish tests before production.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.