openai

Open-sourced evaluation suite from the Monitoring Monitorability paper

46
2
100% credibility
Found Apr 26, 2026 at 46 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

This OpenAI repository open-sources evaluation datasets, prompts, and a mock scaffold for testing AI model monitorability across intervention, process, and outcome archetypes from the Monitoring Monitorability paper.

How It Works

1
🔍 Discover the Test Kit

You find this OpenAI collection of ready-made scenarios to check if AI assistants reveal their true thinking through monitors.

2
📖 Explore the Test Types

You read the simple guide explaining three kinds of tests: ones with hints, step-by-step processes, and outcome checks, drawn from math puzzles and ethical dilemmas.

3
📦 Get the Test Data

You unzip the bundled files to access the scenarios, prompts, and answers without needing external sources for most.

4
🚀 Run a Practice Evaluation

You start the easy demo runner which simulates AI responses and monitoring flags, generating sample scores in moments.

5
📊 Review Your Results

You open the output files to see clear summaries, charts, and scores like geometric means showing monitor effectiveness.

Test Real AI Assistants

Now you can confidently apply these tests to your own AI models to measure how monitorable their inner workings are.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 46 to 46 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is monitorability-evals?

This Python evaluation suite open-sources datasets and tools from OpenAI's Monitoring Monitorability paper, letting you test how well monitor models detect problematic AI behavior across intervention, process, and outcome-property archetypes. It provides JSONL datasets like GPQA and AIME (with URLs for copyrighted problems), prompt templates for models and monitors, and a CLI scaffold to run mock evaluations, grade outcomes, and compute bootstrap metrics like geometric-mean TPR-TNR. Developers get a ready framework to evaluate monitoring reliability without building from scratch.

Why is it gaining traction?

Unlike generic LLM evals, it focuses on monitorability—quantifying if monitors flag interventions like hints or sandbagging via rigorous metrics with cross-fit filtering and bootstrapping for low-noise results. The mock scaffold demos full pipelines across CoT-only, answer-only, or full-message scopes, making it easy to prototype and extend for real models. Tied to a fresh arXiv paper, it standardizes evals in a hot AI safety niche.

Who should use this?

AI safety engineers benchmarking monitor prompts on datasets like scruples or flaky tools. Evals teams at labs comparing model-monitor pairs for production deployment. Researchers replicating paper results or adapting for custom monitoring setups.

Verdict

Grab it if you're in AI monitoring—solid paper-backed datasets and metrics outweigh the low 1.0% credibility score from 46 stars and early-stage docs. Run the scaffold to vet your stack; extend for real use, but expect tweaks for tool-heavy evals.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.