pardcomper / mllm-jailbreak-bench
PublicReproducible benchmark for adversarial attacks on multimodal large language models
MLLM-Jailbreak-Bench is an academic testing framework that evaluates how reliably multimodal AI assistants can be manipulated into producing harmful content through various techniques like hidden instructions in images, audio, and text, helping researchers measure and improve AI safety defenses.
How It Works
A researcher learns about a tool that tests how well AI assistants resist manipulation attempts.
You download and set up the software package on your computer to get started.
You pick which AI assistant you want to examine for vulnerabilities, like a popular image-understanding AI.
Run the AI through various manipulation attempts and see which ones succeed.
Enable built-in safety filters and see how much they help block the attacks.
The tool automatically tests hundreds of different manipulation scenarios against your chosen AI.
You receive detailed reports showing which attacks worked, how often, and how strong the AI's refusals were.
You now have clear insights into where your AI assistant needs improvement to better resist manipulation.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.