meituan-longcat

This is the official repo for the paper "General365: Benchmarking General Reasoning in LLMs under High Difficulty and Diversity".

13
0
100% credibility
Found Apr 14, 2026 at 13 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

General365 is a benchmark with diverse, challenging tasks limited to K-12 knowledge for evaluating large language models' general reasoning abilities, including an evaluation script for scoring model responses.

How It Works

1
🔍 Discover General365

You stumble upon General365, a challenging collection of puzzles that test how well AI models reason using everyday knowledge like a school kid would.

2
📥 Pick it up

You download the benchmark to your computer so you can start using it right away.

3
🤖 Collect AI answers

You ask your AI model to solve the puzzles and save its answers in a simple list.

4
🔗 Link the judge

You connect a helpful AI judge to fairly check if the answers match the correct solutions.

5
▶️ Start grading

You press go, and it quickly reviews all the answers one by one.

🏆 Get your scores

You receive clear results showing your AI's reasoning strengths, ready to compare on the leaderboard.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 13 to 13 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is General365?

General365 is a Python benchmarking toolkit for probing LLMs on 365 diverse, handcrafted reasoning tasks drawn from K-12 knowledge like common sense and basic math. It tackles weak spots in current evals by focusing on pure reasoning over memorization, with a public dataset on Hugging Face and a CLI grader that scores your model's JSONL responses for average accuracy. Users get leaderboard-style results via the official GitHub repository, plus links to the paper and project page.

Why is it gaining traction?

It stands out with high-difficulty categories where even top models scrape passing scores, hybrid rule-plus-LLM grading hitting 99.6% accuracy, and held-out tests to dodge contamination. The quick-start CLI—prep responses, run `python grading.py --response_file yourfile.jsonl`—beats clunky alternatives, while the official GitHub page and leaderboard hook devs chasing honest benchmarks.

Who should use this?

LLM researchers fine-tuning reasoning baselines, eval engineers at AI labs validating SOTA models, or teams building general-knowledge agents needing contamination-proof metrics. Ideal for anyone grading text, choices, numbers, or intervals without setup headaches.

Verdict

Solid for benchmarking despite 13 stars and 1.0% credibility score—docs are crisp, HF dataset ready, but watch for maturity as it's fresh from Meituan. Fork the official repository now if general reasoning evals are your jam.

(178 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.