YoungZ365

YoungZ365 / SOD

Public

PyTorch-based open-source code for paper "SOD: Step-wise On-policy Distillation for Small Language Model Agents"

18
1
100% credibility
Found May 12, 2026 at 18 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

SOD is a research project providing code and methods to distill advanced reasoning abilities from large teacher models into smaller language model agents using step-wise on-policy distillation.

How It Works

1
🔍 Discover SOD

You find a new way to make small AI helpers smarter at solving math, science, and coding puzzles by learning from bigger experts.

2
🛠️ Set up your workspace

You prepare a simple space on your computer to build and train your own reasoning assistant.

3
📥 Gather practice examples

You collect helpful examples of problems and solutions to teach your assistant.

4
🛡️ Create a safe testing area

You set up a protected spot where your assistant can safely try out code ideas without risks.

5
🚀 Train your first assistant

You start with basic lessons, then guide it step by step to think like the experts—watching it get sharper with each round.

6
📊 Test on real challenges

You challenge your assistant with tough math contests, science questions, and code tasks to see its skills shine.

🎉 Your assistant excels

Your small helper now tackles complex problems confidently, beating others and ready for real-world use.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 18 to 18 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is SOD?

SOD is a PyTorch framework for step-wise on-policy distillation, training small language model agents from larger teachers while fixing cascading errors in tool-integrated reasoning tasks. It adaptively weights distillation loss per step—suppressing it on drifted errors, restoring on recoveries, and keeping full guidance on stable paths—all with zero extra compute by reusing log-probs. Developers get ready scripts for SFT cold-starts, distillation training on agentic datasets from Hugging Face, and eval on math/science/code benchmarks like AIME or LiveCodeBench.

Why is it gaining traction?

Unlike standard on-policy distillation, SOD handles tool-use pitfalls without exploding distribution shifts, delivering up to 21% gains over baselines—a 0.6B model hits 26% avg@32 on AIME 2025. It plugs into VeRL/Open-AgentRL for scalable RLHF agents, with Docker setups, sandbox integration, and WandB logging that speed up experiments on H20 GPUs. The paper-backed method shines for efficient small agents in distillation code on GitHub.

Who should use this?

ML engineers building compact reasoning agents for math solvers, code generators, or science QA pipelines. Ideal for teams distilling Qwen models (1.7B-32B) on agent RL data, needing tool calls without error cascades—think replacing bloated teachers in production workflows.

Verdict

Grab it if agent distillation is your jam; low 1.0% credibility and 18 stars signal early maturity, but solid README/scripts and HF datasets make prototyping viable. Monitor for more community recipes as it grows.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.