meituan-longcat / General365
PublicThis is the official repo for the paper "General365: Benchmarking General Reasoning in LLMs under High Difficulty and Diversity".
General365 is a benchmark with diverse, challenging tasks limited to K-12 knowledge for evaluating large language models' general reasoning abilities, including an evaluation script for scoring model responses.
How It Works
You stumble upon General365, a challenging collection of puzzles that test how well AI models reason using everyday knowledge like a school kid would.
You download the benchmark to your computer so you can start using it right away.
You ask your AI model to solve the puzzles and save its answers in a simple list.
You connect a helpful AI judge to fairly check if the answers match the correct solutions.
You press go, and it quickly reviews all the answers one by one.
You receive clear results showing your AI's reasoning strengths, ready to compare on the leaderboard.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.