BattleWen

BattleWen / MAGIC

Public

Code for paper "MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM safety"

36
2
100% credibility
Found Feb 03, 2026 at 15 stars 2x -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

MAGIC trains language models to be safer by pitting an attacker AI against a defender AI in an adversarial game using reinforcement learning.

How It Works

1
🔍 Discover MAGIC

You stumble upon this project while searching for ways to make AI chatbots safer and more reliable.

2
📖 Understand the idea

You learn it's like a game where one AI tries tricks and another learns to stay safe, making chats better.

3
🛠️ Prepare your setup

You follow easy steps to get the tools ready on your computer, like installing helpers.

4
🚀 Launch the safety game

You start the training match between the tricky attacker and the smart defender AI.

5
📊 Check the results

You test your trained AI on challenges to see how well it blocks bad requests.

Safer AI achieved!

Your AI now smartly refuses harmful ideas while helping with good ones, feeling secure.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 15 to 36 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is MAGIC?

MAGIC delivers Python code from a GitHub repository for the paper on co-evolving attacker-defender games to boost LLM safety. It trains LLMs to resist jailbreaks by pitting an attacker that crafts adversarial prompts against a defender that learns safe replies, using supervised fine-tuning on attack pools and RL for iterative hardening. Developers get bash scripts to launch training on models like Qwen2.5-7B, plus evals on HarmBench and WildGuardTest via integrated tools like OLMES and safety benchmarks.

Why is it gaining traction?

Unlike symmetric self-play setups, MAGIC's sequential asymmetric game sidesteps gradient conflicts, enabling continuous vulnerability discovery with 20 chain-of-thought rewriting strategies that tackle red-teaming cold starts. It yields lower attack success rates on benchmarks like DAN and X-Teaming with minimal capability loss, appealing to those tweaking code GitHub AI safety in Python repos. The Hugging Face integration and vLLM speedups make prototyping fast.

Who should use this?

AI researchers hardening open LLMs against prompts, red-teamers automating jailbreak tests, or safety engineers at startups evaluating defender models pre-deployment. Ideal for teams using code GitHub Copilot who need robust refusal training without full RLHF overhead.

Verdict

With 32 stars and 1.0% credibility score, it's raw research code—docs are paper-focused, tests sparse—but the eval suite shines for safety metrics. Fork for LLM safety experiments if you're in code GitHub Python red-teaming.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.