safety-research

Training LLMs to Report Their Learned Behaviors

11
1
100% credibility
Found Apr 30, 2026 at 11 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A research toolkit for training and testing special helpers that make AI models describe their learned behaviors, especially for safety checks.

How It Works

1
๐Ÿ“ฐ Discover Introspection Adapters

You hear about a clever tool from a research paper that helps big AI models explain their hidden habits and behaviors out loud.

2
๐Ÿ’ป Get everything ready

Download the handy kit and connect a couple of smart AI friends using private passwords so they can help with the thinking.

3
Pick your adventure
๐Ÿ”„
Build a new one

Train your own explainer on examples of sneaky AI tricks to make it smart at spotting them.

๐Ÿงช
Test an existing one

Put a ready explainer through challenges with unfamiliar behaviors to see how well it performs.

4
โ–ถ๏ธ Run the checker

Press start and watch it learn from good and bad examples or grade tricky hidden patterns.

5
๐Ÿ“ˆ Uncover the secrets

Beautiful charts pop up showing exactly how often the AI admits its true behaviors.

๐ŸŽ‰ Achieve clear insights

You now have proof of what behaviors lurk inside the AI, making it safer and more understandable.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 11 to 11 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is introspection-adapters?

This Python pipeline trains lightweight LoRA adapters to make LLMs self-report hidden behaviors from fine-tuning, like backdoors or quirks. Point it at a base model (Llama 70B or Qwen 14B), feed examples via bash scripts and config files, and it handles SFT training plus optional DPO refinement for training LLMs for honesty. Evaluate verbalization rates on OOD behaviors (prism4, ukaisi, encrypted harm) with plots โ€” pretrained adapters live on Hugging Face.

Why is it gaining traction?

It packages safety research into dead-simple workflows: copy a config, run `bash scripts/train_ia.sh`, get introspection on adapters and behaviors without reinventing training LLMs with reinforcement learning. Stands out for quantifying "confessions" on tough tests like covert malicious fine-tuning, appealing to devs training LLMs to better self-debug and explain code or handle github repo training edge cases.

Who should use this?

AI safety researchers probing fine-tunes for sandbagging or harmful behaviors. LLM auditors testing training github copilot setups on private repos. Teams training LLMs for honesty via confessions in high-stakes deployments.

Verdict

Worth forking for niche LLM auditing โ€” solid pipelines despite 11 stars and 1.0% credibility score. Early maturity means README-only docs and no tests, but it delivers on A100s; watch for broader model support.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.