lechmazur

A benchmark for testing whether LLM judges keep the same preference when two lightly edited versions of the same story are shown in opposite orders.

11
0
100% credibility
Found Apr 26, 2026 at 11 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
HTML
AI Summary

This repository shares a benchmark dataset and analysis revealing how large language models exhibit position bias when judging pairwise story variants in swapped orders.

How It Works

1
🔍 Discover the Benchmark

You stumble upon this GitHub page while searching for ways AI models might unfairly judge stories based on their position.

2
📖 Read the Overview

You dive into the main page to learn how swapping the order of two similar stories reveals if AI judgments stay consistent.

3
🏆 Explore the Leaderboard

You check the rankings to see which AI models resist changing their picks when the story order flips, with clear winners at the top.

4
📊 View Charts and Stats

You look at colorful graphs showing flip rates, biases, and how ratings shift, making the patterns easy to grasp.

5
🔍 Study Real Examples

You open detailed case studies, like the midnight bakery story, to see exactly how different AIs react in swapped views.

6
📁 Peek at the Data

You browse the shared files listing prompts, answers, and results to verify or dig deeper yourself.

💡 Gain Key Insights

You now understand position bias in AI judging, helping you choose more reliable models or design fairer evaluations.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 11 to 11 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is position_bias?

position_bias runs a position bias benchmark testing if LLM judges stick to the same preference when pairwise comparing lightly edited story versions shown in swapped orders. It outputs leaderboards ranking models by order flip rates (median 44.8%), first-shown pick rates (average 63.3%), and rating bonuses, using 193 verified pairs across 27 judges. As an HTML-hosted benchmark testing tool with public CSV/JSONL data bundles, it flags order contamination in evals without needing your own runs.

Why is it gaining traction?

Unlike generic LLM benchmarks, it isolates position bias with decisive coverage and case breakdowns—like "midnight bakery" flipping 87.5% of judges—making hidden prompt-order flaws visible fast. Devs hook on the model scatterplots and outcome mixes for quick insights, akin to a benchmark github action exposing biases in tools like github copilot. Public artifacts let you slice data for custom position bias aba examples or extend to gpu/cpu benchmark testing.

Who should use this?

LLM eval teams auditing judge stability before preference labeling or rubric grading pipelines. AI researchers benchmarking models for writing contests, search ranking, or A/B reviews. Prod engineers randomizing orders in LLM-as-judge workflows to cut noise.

Verdict

Grab it for targeted position bias testing—strong metrics and data transparency punch above 11 stars—but 1.0% credibility signals early maturity; solid docs help, but test coverage and updates will build trust. (187 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.