safety-research / introspection-adapters

Public

Training LLMs to Report Their Learned Behaviors

100% credibility

Found Apr 30, 2026 at 11 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

A research toolkit for training and testing special helpers that make AI models describe their learned behaviors, especially for safety checks.

How It Works

📰 Discover Introspection Adapters

You hear about a clever tool from a research paper that helps big AI models explain their hidden habits and behaviors out loud.

💻 Get everything ready

Download the handy kit and connect a couple of smart AI friends using private passwords so they can help with the thinking.

Pick your adventure

🔄

Build a new one

Train your own explainer on examples of sneaky AI tricks to make it smart at spotting them.

🧪

Test an existing one

Put a ready explainer through challenges with unfamiliar behaviors to see how well it performs.

▶️ Run the checker

Press start and watch it learn from good and bad examples or grade tricky hidden patterns.

📈 Uncover the secrets

Beautiful charts pop up showing exactly how often the AI admits its true behaviors.

🎉 Achieve clear insights

You now have proof of what behaviors lurk inside the AI, making it safer and more understandable.

Sign up to see the full architecture

4 more

Star Growth

See how this repo grew from 11 to 11 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is introspection-adapters?

This Python pipeline trains lightweight LoRA adapters to make LLMs self-report hidden behaviors from fine-tuning, like backdoors or quirks. Point it at a base model (Llama 70B or Qwen 14B), feed examples via bash scripts and config files, and it handles SFT training plus optional DPO refinement for training LLMs for honesty. Evaluate verbalization rates on OOD behaviors (prism4, ukaisi, encrypted harm) with plots — pretrained adapters live on Hugging Face.

Why is it gaining traction?

It packages safety research into dead-simple workflows: copy a config, run `bash scripts/train_ia.sh`, get introspection on adapters and behaviors without reinventing training LLMs with reinforcement learning. Stands out for quantifying "confessions" on tough tests like covert malicious fine-tuning, appealing to devs training LLMs to better self-debug and explain code or handle github repo training edge cases.

Who should use this?

AI safety researchers probing fine-tunes for sandbagging or harmful behaviors. LLM auditors testing training github copilot setups on private repos. Teams training LLMs for honesty via confessions in high-stakes deployments.

Verdict

Worth forking for niche LLM auditing — solid pipelines despite 11 stars and 1.0% credibility score. Early maturity means README-only docs and no tests, but it delivers on A100s; watch for broader model support.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

545

Followers

Base stars: 11 stars

Penalty: Very new repo (2d): -70%

Bonus: AI verified quality (100%)

Account age: 504 days

Repo age: 2 days

Updated: Apr 30, 2026