Code for paper "MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM safety"
MAGIC trains language models to be safer by pitting an attacker AI against a defender AI in an adversarial game using reinforcement learning.
How It Works
You stumble upon this project while searching for ways to make AI chatbots safer and more reliable.
You learn it's like a game where one AI tries tricks and another learns to stay safe, making chats better.
You follow easy steps to get the tools ready on your computer, like installing helpers.
You start the training match between the tricky attacker and the smart defender AI.
You test your trained AI on challenges to see how well it blocks bad requests.
Your AI now smartly refuses harmful ideas while helping with good ones, feeling secure.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.