HudsonGri

HudsonGri / mdarena

Public

Benchmark your CLAUDE.md against your own PRs

44
1
100% credibility
Found Apr 07, 2026 at 44 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

mdarena benchmarks custom AI instructions by having an AI agent solve real past coding tasks from a user's repository under different guidance conditions.

How It Works

1
πŸ” Discover mdarena

You learn about a handy tool that checks if your special notes really help an AI fix bugs in your own projects.

2
πŸ“¦ Set it up easily

You add the tool to your computer with a simple download command, and it's ready to go.

3
⛏️ Gather your past fixes

You share your project's web address, and it pulls in real changes from your history to use as test challenges.

4
πŸ€– Watch AI tackle tasks

The AI tries solving those challenges using different versions of your notes versus none at all.

5
πŸ“Š Review the results

A clear summary appears, comparing success rates, costs, and speeds for each set of notes.

πŸ† Pick the best notes

You now know which instructions make your AI shine brightest on your real work!

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 44 to 44 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is mdarena?

mdarena is a Python CLI that benchmarks your CLAUDE.md files against real merged PRs from your GitHub repo. It mines PRs into tasks with auto-detected test commands from CI workflows or package files, then runs Claude Code head-to-head: baseline (no context) versus your CLAUDE.md variants, scoring patches by test pass rates, diff overlap, cost, and statistical significance. Run `mdarena mine owner/repo --detect-tests`, `mdarena run -c claude_v1.md -c claude_v2.md`, then `mdarena report` for winner/loser breakdowns; imports/exports SWE-bench tasks too.

Why is it gaining traction?

Unlike generic benchmarks, it uses your own PRs for realistic tasks, auto-detecting tests to avoid LLM judging or string matching. Monorepo support lets you benchmark directory trees of CLAUDE.md files against a baseline that strips all context, revealing if instructions help or hurt (research shows many do the latter). Head-to-head reports with p-values and targeted test runs make it a quick way to validate against GitHub Copilot-style agents.

Who should use this?

Teams building Claude Code agents on internal repos, especially monorepos where per-directory CLAUDE.md trees need tuning. AI engineers iterating on custom prompts before GitHub Actions deployment, or devs benchmarking GitHub Copilot alternatives on production codebases with real test suites.

Verdict

Grab it if you're writing CLAUDE.mdβ€”proves value fast with solid CLI and docs, despite 43 stars and 1.0% credibility score signaling alpha maturity. MIT-licensed Python package; test your setups before blindly shipping.

(187 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.