kokolerk

kokolerk / TCOD

Public

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

16
0
100% credibility
Found Apr 29, 2026 at 16 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

TCOD is a research framework that applies temporal curriculum learning to improve on-policy distillation for training multi-turn AI agents in interactive environments like ALFWorld, ScienceWorld, and WebShop.

How It Works

1
🔍 Discover TCOD

You find this helpful tool for training smarter AI helpers that chat and act over many turns, like in games or shopping worlds.

2
🛠️ Set up your workspace

Follow simple steps to prepare your computer with the needed programs and connect an AI thinking service.

3
📥 Gather game worlds

Download ready-to-use worlds like kitchen adventures or online shops so your AI can practice real tasks.

4
🎯 Choose your learning path

Pick a training style like steady steps or backward-to-forward to guide your AI helper smoothly from easy to full challenges.

5
▶️ Start the training adventure

Click to launch and watch your AI helper learn from a wise teacher, getting better turn by turn.

6
📊 Track the progress

See charts and scores update, showing your AI improving stability and skills on practice tasks.

🏆 Celebrate smarter agent

Your AI helper now handles long conversations better than its teacher, ready for real-world adventures!

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 16 to 16 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is TCOD?

TCOD is a Python library on GitHub that implements temporal curriculum learning for on-policy distillation, helping distill large teacher LLMs into smaller student agents for multi-turn interactive environments like ALFWorld, ScienceWorld, and WebShop. It tackles trajectory-level KL divergence instability—where compounding errors in long rollouts make supervision unreliable—by progressively expanding trajectory lengths from stable short prefixes to full multi-turn sequences. Users get plug-and-play YAML configs to launch experiments via Trinity-RFT's CLI, with teacher-student logprob gaps driving stable knowledge transfer.

Why is it gaining traction?

Unlike vanilla on-policy distillation, TCOD's backward-to-forward (b2f) and forward-to-backward (f2b) strategies keep students in the teacher's distribution, yielding up to 18-point gains, smoother KL curves, and students outperforming teachers on held-out tasks. Developers appreciate the benchmark-ready setups for autonomous agents, easy integration with vLLM engines, and arXiv-backed results showing better generalization. It's a practical fix for multi-turn RLHF pain points without custom workflow hacks.

Who should use this?

RLHF engineers training autonomous agents for text-based games or web navigation tasks, where long-horizon stability matters. Ideal for researchers replicating/exploring curriculum distillation in multi-turn setups, or teams distilling Qwen models for interactive envs like ALFWorld cleaning missions.

Verdict

Grab it if you're experimenting with on-policy distillation for agents—strong paper, HF models, and runnable configs punch above the 16 stars and 1.0% credibility score. Still early-stage with setup hurdles for envs like WebShop; test on small scales first.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.