beanie00

Codebase for the work “Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?”

25
2
100% credibility
Found Mar 29, 2026 at 25 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

Research codebase reproducing why self-distillation degrades LLM reasoning by suppressing uncertainty expression, with scripts for analysis, data prep, model training, and evaluation on math tasks.

How It Works

1
🔍 Spot the mystery

You stumble upon research showing why copying perfect answers sometimes makes AI worse at math puzzles.

2
📚 Follow the guide

Read the clear instructions to set up your tools and grab sample math problems.

3
🧐 Test AI thinking

Run quick checks to watch how confident hints make AI lose its hesitant, exploratory style.

4
Choose your path
📊
Analyze more

Dive deeper into reports on uncertainty words fading away.

⚙️
Train models

Launch experiments blending hints with original thinking.

5
📈 See improvements

Compare before-and-after results on tough unseen problems.

🎉 Reasoning unlocked

Your AI now tackles harder math with balanced confidence and exploration.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 25 to 25 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is self-distillation-analysis?

This Python codebase lets researchers reproduce and dissect self-distillation experiments on large language models, zeroing in on why it tanks reasoning performance in math tasks. Built atop SDPO for preference optimization, it handles data prep from datasets like DAPO-Math-17k, model evaluation with hints, SFT dataset creation, and training runs via GRPO or SDPO variants. Users get Docker setups, eval scripts, and Hugging Face checkpoints for Qwen and DeepSeek models—ideal for probing epistemic verbalization loss without rebuilding from scratch. Unlike flaky codebases not working in Cursor or GitHub Copilot's codebase indexing hiccups, this one prioritizes clean repro on large math problems.

Why is it gaining traction?

In a sea of LLM training repos, it stands out by pinpointing a subtle failure: confident teacher outputs suppress uncertainty signals, hurting generalization on broad task coverage. Devs dig the full pipeline—eval baselines, rich feedback loops, multiturn generalization tests—over vague workspace vs codebase debates in Copilot setups. With W&B logs and arXiv linkage, it's a go-to for analysis on GitHub codebase RAG or xAI-style large codebase GitHub experiments, even if stars are modest at 25.

Who should use this?

LLM researchers dissecting RLHF pitfalls, math reasoning trainers scaling DAPO-style datasets, or academics verifying self-distillation claims before fine-tuning. Perfect for teams hitting OOD drops in code/math hybrids, or anyone benchmarking against baselines like GRPO on GPU clusters.

Verdict

Grab it if you're deep in LLM reasoning repros—solid Dockerized flow and HF models make it practical despite 1.0% credibility from low stars and academic focus. Skip for production; it's research-grade, not battle-tested at scale. (198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.