InternScience

ResearchClawBench: Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery

16
0
100% credibility
Found Mar 20, 2026 at 16 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Jupyter Notebook
AI Summary

ResearchClawBench provides 40 expert-curated scientific research tasks with real datasets from published papers across 10 domains to benchmark AI agents' end-to-end research capabilities via autonomous analysis and peer-review-style LLM evaluation.

How It Works

1
🌐 Discover the benchmark

You stumble upon ResearchClawBench, a fun way to see if AI helpers can tackle real science puzzles just like expert researchers.

2
🚀 Start the app

Download it to your computer and launch the simple viewer with a few clicks—no tech skills needed.

3
📚 Pick a science challenge

Browse easy categories like earth science or neuroscience, and select one of 40 real-world problems with actual experiment data.

4
🤖 Watch AI do research

Choose your favorite AI agent and hit go—see it live explore data, crunch numbers, make charts, and write a full report.

5
📊 Compare to the original

Side-by-side, check the AI's findings against the human scientist's paper and key success checklist.

6
Score and rank it

Get automatic grades on each part, a total score (50 matches the paper, higher beats it), and add to the global leaderboard.

🏆 Celebrate insights

You've tested cutting-edge AI on genuine science, discovered strengths and gaps, and contributed to the frontier of smart machines.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 16 to 16 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is ResearchClawBench?

ResearchClawBench lets you evaluate AI agents on automated research tasks, feeding them raw datasets from real papers across 10 scientific domains like neuroscience and energy. Agents explore data, write code, generate figures, and produce reports in a sandboxed workspace, scored via LLM judges against the original publication using fine-grained checklists for re-discovery (match the paper at 50) or new-discovery (surpass it above 70). Built around Jupyter Notebook with a Flask-based live-streaming UI, it delivers leaderboards, per-task breakdowns, and easy agent swaps.

Why is it gaining traction?

It skips toy coding benchmarks for 40 expert-curated tasks with full datasets, letting you watch agents code and plot in real-time while tracking progress. Agent-agnostic setup supports Claude Code, OpenClaw, and custom ones via simple JSON configs, plus multimodal scoring for text and images. Developers dig the two-stage pipeline—autonomous research then peer-review eval—that reveals true scientific capability.

Who should use this?

AI researchers benchmarking coding agents for research automation, like those building tools for data analysis in physics or biology. Teams evaluating frontier models on end-to-end workflows, from data exploration to publication-ready reports. Anyone testing agent reliability on reproducible science challenges.

Verdict

Promising for agent evals but early-stage with 16 stars and 1.0% credibility—docs are clear, quick start works, but expect tweaks for production. Try it if automated research benchmarks matter; skip for mature alternatives.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.