GAIR-NLP

AcademiClaw: When Students Set Challenges for AI Agents — a bilingual benchmark of 80 university student-sourced academic tasks.

11
0
100% credibility
Found May 05, 2026 at 11 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
C
AI Summary

AcademiClaw is a benchmark of 80 bilingual tasks from real undergraduate academic workflows designed to rigorously evaluate AI agents' long-horizon reasoning and tool-use capabilities.

How It Works

1
🔍 Discover tough AI challenges

You find AcademiClaw, a collection of 80 real academic puzzles from students that stump top AI helpers.

2
📥 Grab the benchmark

Download the ready package with tasks, test setups, and example results from leading AIs.

3
🔧 Prepare your tests

Set up a safe space to run challenges, matching your computer's power needs.

4
🤖 Link your AI

Connect whichever smart assistant you want to challenge against the tasks.

5
🚀 Launch the evaluations

Watch as your AI tackles dozens of tough problems like coding ray tracers or solving math proofs.

6
📊 Review the scores

Get automatic grades on success, safety, and efficiency for each task and category.

🏆 Compare and share

See how your AI stacks up on the live leaderboard and share your breakthrough results.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 11 to 11 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is AcademiClaw?

AcademiClaw is a bilingual benchmark of 80 student-sourced academic tasks that university students created after mainstream AI agents failed their real workflows. It lets you test agents on long-horizon challenges spanning research analysis, ML engineering, software engineering, and STEM reasoning—many in C++ like BVH-accelerated path tracing. Users get isolated Docker sandboxes for reproducible evals, multi-dimensional rubrics (code exec, LLM judges, vision checks), and a live leaderboard comparing models like Claude Opus on pass rates and safety.

Why is it gaining traction?

Unlike synthetic benchmarks, these tasks come straight from undergrads who set challenges agents couldn't handle without hand-holding, revealing true capability gaps—even top models pass only 55%. The bilingual (EN/ZH) mix, GPU/CPU routing, and adapters for Claude Code or OpenAI endpoints make it dead simple to batch-eval your agent via bash scripts. Developers dig the full trajectories, token/cost breakdowns, and safety audits (S1-S5) that go beyond raw scores.

Who should use this?

AI agent builders tuning models for academic or engineering workflows, especially those hitting walls on C++ rendering, RL training, or distributed systems debugging. Researchers benchmarking LLMs on bilingual tasks or long-context reasoning. Teams validating agent safety before production use.

Verdict

Grab it if you're serious about agent evals—solid Docker harness and rubrics punch above the 11 stars and 1.0% credibility score. Still early (docs are paper-heavy, no broad tests), so expect tweaks for custom agents.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.