A research toolkit for training and testing special helpers that make AI models describe their learned behaviors, especially for safety checks.
How It Works
You hear about a clever tool from a research paper that helps big AI models explain their hidden habits and behaviors out loud.
Download the handy kit and connect a couple of smart AI friends using private passwords so they can help with the thinking.
Train your own explainer on examples of sneaky AI tricks to make it smart at spotting them.
Put a ready explainer through challenges with unfamiliar behaviors to see how well it performs.
Press start and watch it learn from good and bad examples or grade tricky hidden patterns.
Beautiful charts pop up showing exactly how often the AI admits its true behaviors.
You now have proof of what behaviors lurk inside the AI, making it safer and more understandable.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.