WisdomShell

WisdomShell / ADG

Public

[ACL'26 Main Conference] Instruction Data Selection via Answer Divergence

16
1
100% credibility
Found Apr 19, 2026 at 16 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A codebase implementing Answer Divergence-Guided (ADG) selection to choose high-quality instruction data for fine-tuning language models like LLaMA and Qwen by scoring response diversity.

How It Works

1
๐Ÿ“– Discover ADG

You hear about a clever way to pick the best teaching examples to make AI helpers smarter and more reliable.

2
๐Ÿ“ฅ Gather Examples

You collect a big list of instructions with sample answers that you want to use for training your AI.

3
โœจ Generate Variations

The tool asks your base AI to create several different responses for each instruction to see its creativity.

4
๐Ÿ” Measure Diversity

It checks how spread out and varied those responses are, scoring each instruction for quality.

5
Choose AI Path
๐Ÿฆ™
Llama Helper

Use the scorer made for Llama-style AIs.

๐Ÿš€
Qwen Helper

Use the scorer tailored for Qwen-style AIs.

6
๐ŸŽฏ Pick Best Ones

You get sorted lists of top, middle, and bottom examples for balanced training.

7
๐Ÿ‹๏ธ Train Your AI

Feed the selected top examples to fine-tune your AI and make it better at tasks.

๐ŸŽ‰ Smarter AI Ready

Test your improved AI on benchmarks and celebrate better reasoning, knowledge, and coding skills!

Sign up to see the full architecture

6 more

Sign Up Free

Star Growth

See how this repo grew from 16 to 16 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is ADG?

ADG is a Python toolkit for selecting high-quality instruction data via answer divergence, scoring instructions by sampling multiple responses from a base model like Llama or Qwen and measuring their geometric spread in embedding space. It solves the fixed-budget data selection problem for LLM instruction tuning, delivering top, middle, and bottom subsets across semantic clusters for better coverage on reasoning, knowledge, and coding tasks. Users get a full pipeline: distributed answer generation, embedding/clustering, scoring, training scripts, and lm-evaluation-harness benchmarks.

Why is it gaining traction?

Unlike single-reference scorers, ADG combines dispersion magnitude and shape anisotropy for robust, geometry-aware selection, consistently boosting performance under 10K budgets as shown in the ACL'26 main conference paper. Developers dig the end-to-end workflow with torchrun for multi-GPU generation/training and bin-wise proportional picking to avoid dense-region collapse. It's practical for public pools like Alpaca-GPT4 or WizardLM, with clear quick-start paths.

Who should use this?

ML engineers fine-tuning open models on instruction datasets, especially those hitting data quality walls in reasoning-heavy apps. Researchers reproducing data selection experiments or iterating on custom pools for coding/knowledge tasks. Teams with GPU clusters needing distributed pipelines beyond basic random sampling.

Verdict

Worth forking for ACL'26 repros or LLM data curation experimentsโ€”solid docs and pipeline despite 16 stars and 1.0% credibility score signaling early maturity. Update paths, watch GPU memory, and test on your pool before production; lacks broad tests but delivers research-grade results fast. (198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.