Picrew / ConStory-Bench

Public

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

picrew.github.ioconstory-bench.github.io

100% credibility

Found Mar 10, 2026 at 35 stars 2x -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

ConStory-Bench is a benchmark and toolkit for evaluating how consistently large language models maintain narrative details, timelines, characters, and rules in long generated stories.

How It Works

🔍 Discover ConStory-Bench

You stumble upon this fun benchmark that tests how well AI storytellers keep their tales consistent without silly mix-ups.

📥 Pick up story starters

Grab the collection of creative prompts ready to spark long adventures and reveal hidden flaws.

🤖 Team up with your AI

Connect your favorite AI writer so it can dream up detailed stories from those prompts.

✨ Stories come alive

Sit back as your AI crafts epic narratives full of characters, plots, and worlds that feel real.

🕵️ Hunt for slip-ups

Let the clever checker scan each story for contradictions like forgotten facts or timeline twists.

📊 Review the results

See simple scores showing consistency strength, plus rankings against the best AIs out there.

🏆 Unlock better stories

Celebrate knowing which AIs spin the most reliable yarns and how to make yours even stronger!

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 35 to 78 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is ConStory-Bench?

ConStory-Bench is a Python benchmark for spotting consistency bugs in long story generation by LLMs—like characters forgetting backstories or timelines breaking in "lost stories." It provides 2000 prompts across four task types (generation, continuation, expansion, completion), an automated ConStory-Checker to detect errors in five categories with 19 subtypes, and Hugging Face datasets for stories and evaluations. Users generate stories via OpenAI-compatible APIs, judge them with any LLM, and compute metrics like CED (errors per 10k words) or GRR (relative ranking).

Why is it gaining traction?

It stands out with a live leaderboard tracking top models (GPT-5-Reasoning leads at CED 0.113), positional error analysis showing where bugs like timeline contradictions hit, and correlation matrices revealing linked failures (e.g., timeline bugs often pair with factual ones). CLI scripts handle generation, judging, metrics, and analysis with resume support, working seamlessly with local servers like vLLM—no custom setups needed. Developers dig the arXiv-backed rigor and pre-computed results for 30+ models.

Who should use this?

LLM researchers benchmarking narrative consistency in story-writing agents, AI product devs tuning long-form gen for games or novels, and eval teams hunting "lost stories" bugs in proprietary/open models like Qwen3 or Claude. Ideal for anyone generating 10k+ word outputs needing quantifiable fixes beyond perplexity.

Verdict

Grab it if you're deep into LLM story eval—solid docs, HF integration, and CLI make it instantly usable despite 16 stars and 1.0% credibility score signaling early maturity. Run your model on the leaderboard prompts today; pair with recovery codes if your GitHub account gets lost mid-experiment.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 78 stars

Bonus: AI verified quality (100%)

Account age: 1,523 days

Repo age: 9 days

License: MIT

Updated: Mar 14, 2026