pardcomper / mllm-jailbreak-bench

Public

Reproducible benchmark for adversarial attacks on multimodal large language models

69% credibility

Found May 27, 2026 at 46 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

MLLM-Jailbreak-Bench is an academic testing framework that evaluates how reliably multimodal AI assistants can be manipulated into producing harmful content through various techniques like hidden instructions in images, audio, and text, helping researchers measure and improve AI safety defenses.

How It Works

🔬 Discover the benchmark

A researcher learns about a tool that tests how well AI assistants resist manipulation attempts.

📦 Install the testing toolkit

You download and set up the software package on your computer to get started.

🤖 Choose an AI model to test

You pick which AI assistant you want to examine for vulnerabilities, like a popular image-understanding AI.

Decide what to measure

⚔️

Test attacks directly

Run the AI through various manipulation attempts and see which ones succeed.

🛡️

Test with defenses first

Enable built-in safety filters and see how much they help block the attacks.

▶️ Run the evaluation

The tool automatically tests hundreds of different manipulation scenarios against your chosen AI.

📊 Review the results

You receive detailed reports showing which attacks worked, how often, and how strong the AI's refusals were.

✅ Understand your AI's weaknesses

You now have clear insights into where your AI assistant needs improvement to better resist manipulation.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 46 to 46 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is mllm-jailbreak-bench?

mllm-jailbreak-bench is a Python framework for testing how well multimodal AI systems resist adversarial manipulation. It throws six different attack styles at vision-language models—image injection, audio injection, OCR-based jailbreaking, and others—to see which ones slip past safety guardrails. The benchmark runs on any HuggingFace-compatible model through a thin adapter, measures attack success rates, and spits out a leaderboard. It includes reference defenses so you can test whether input filtering or self-critique passes actually help.

Why is it gaining traction?

The multimodal AI space is exploding, but nobody had a standardized way to compare safety across models until now. This fills that gap with deterministic eval loops, fixed seeds, and calibration metrics that distinguish "the attack worked" from "the model is just broken." The CLI makes it dead simple: `jbb run --target llava-1.5-7b --attacks all` gets you results in minutes. Researchers can reproduce paper results with one script, and practitioners get a fair apples-to-apples comparison across vendors.

Who should use this?

AI safety researchers evaluating multimodal models. Red teamers auditing commercial deployments. ML teams that need to document safety properties before shipping. If you are comparing vision-language models for production and need to quantify jailbreak resistance, this gives you the numbers.

Verdict

The tooling is solid—clean API, documented attacks, proper calibration metrics—but with 46 stars and only one paper citation, the credibility score sits at 0.7%. That is not a knock on quality; it reflects community vetting time. Use it for benchmarking, but treat results as directional until the project gains more external validation. Worth bookmarking.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 46 stars

Penalty: Very new repo (2d): -70%

Penalty: AI uncertain (70%): -90%

Account age: 123 days

Repo age: 2 days

License: NOASSERTION

Updated: May 26, 2026