LongHorizonReasoning

Benchmarking long-horizon chain-of-thought reasoning.

20
0
100% credibility
Found Apr 19, 2026 at 20 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

LongCoT is a benchmark with thousands of expert-crafted problems across logic, computer science, chemistry, chess, and math domains to evaluate AI models' long chain-of-thought reasoning abilities.

How It Works

1
🔍 Discover LongCoT

You come across LongCoT, a collection of tricky puzzles designed to test how well AI brains handle long step-by-step thinking in areas like puzzles, science, and games.

2
💻 Set up on your computer

You grab the benchmark and get it ready on your machine, feeling excited to start testing AIs.

3
🤖 Pick an AI to challenge

You choose a smart AI service and connect it so it can dive into the puzzles with its full reasoning power.

4
🚀 Run the puzzle tests

You send batches of problems to the AI and watch it generate detailed reasoning chains, sometimes pages long.

5
📊 Check the answers

The tool automatically reviews each response against the correct solutions, tallying up what's right or wrong.

🏆 See your AI's true smarts

You get clear scores showing how well the AI stays on track over marathon thinking sessions, ready to share or improve.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 20 to 20 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is longcot?

LongCoT is a Python benchmark for testing LLMs on long-horizon chain-of-thought reasoning, with 2,500 expert-designed problems across logic, computer science, chemistry, chess, and math domains. It pairs short prompts with massive reasoning traces—tens to hundreds of thousands of tokens—where difficulty comes from sustaining coherence over many steps, like tracking state or recovering from errors. Users get CLI tools to run inference on providers like OpenAI or Anthropic (`uv run python run_inference.py --config oai_gpt52`), save JSONL responses, and evaluate accuracy via deterministic verifiers (`uv run python run_eval.py`), plus a Python API for custom harnesses.

Why is it gaining traction?

Unlike generic LLM benchmarking GitHub repos, LongCoT focuses on long-horizon tasks that expose real weaknesses in frontier models, with programmatic verification in every domain—no fuzzy human judging. Built-in configs for models like Claude Sonnet or DeepSeek, parallel inference with retries, and fallback LLM judges for edge cases make evals fast and reliable. The arXiv paper and leaderboard submissions hook researchers chasing deepplanning benchmarks for chain-of-thought limits.

Who should use this?

LLM eval engineers benchmarking reasoning depth on production models. AI researchers comparing long CoT across providers before fine-tuning. Devs at labs like Oxford or LLNL validating long-horizon capabilities in logic puzzles or chemistry synthesis without building verifiers from scratch.

Verdict

Grab it for specialized LLM benchmarking on GitHub if long-horizon CoT is your focus—docs are thorough, setup via uv is smooth, and the API integrates easily. At 20 stars and 1.0% credibility, it's early but backed by a strong arXiv paper; test on LongCoT-Mini first to confirm fit.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.