evolvent-ai

🦞 ClawMark: A Living-World Benchmark for Multi-Day, Multimodal Coworker Agents

32
0
100% credibility
Found Apr 14, 2026 at 32 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

ClawMark is a benchmark that tests AI agents on realistic multi-day professional tasks using simulated tools like email, calendars, spreadsheets, and files across domains such as healthcare and sales.

How It Works

1
🔍 Discover ClawMark

You stumble upon ClawMark on GitHub or a blog, a fun way to test AI helpers on real office jobs like helping doctors or managing HR.

2
📊 Explore the tests

You browse leaderboards and examples, seeing how different AIs handle multi-day tasks with emails, plans, and spreadsheets.

3
🔗 Link your tools

You connect an AI thinker, your planning pages, spreadsheets, and calendar so the tests feel real.

4
🚀 Run your first test

With one click, you launch an AI coworker on a job like reviewing patient meds over three days—it thinks, checks files, and emails.

5
🔄 Test more scenarios

You try single jobs or full suites across clinics, sales, or events to compare AI performance.

6
📈 Review the outcomes

You open folders to see scores, chat histories, workspaces, and exactly what the AI did right or wrong.

🏆 Pick your best AI teammate

You now know which AIs shine as reliable coworkers for tough, ongoing office work.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 32 to 32 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is ClawMark?

ClawMark is a Python benchmark for multi-day, multimodal coworker agents, simulating living-world office tasks across 13 domains like clinical assistance, HR, insurance, and ecommerce. It runs 100 tasks spanning 1-3 working days, where agents coordinate real backends for email, calendars, Notion pages, Google Sheets, and filesystems amid multimodal inputs like screenshots, PDFs, audio, and CSVs. Users get CLI-driven evals with rule-based scoring, agent traces, and Docker-isolated workspaces for reproducible ClawMark results—no LLM judges needed.

Why is it gaining traction?

Unlike single-turn agent benchmarks, ClawMark stresses timeline-driven stages, cross-tool state reconciliation, and proactive handling of implicit changes like new emails or sheet updates. Its strict Python checkers deliver 100% reproducible scores, plus leaderboards showing models like GPT-5.4 at 55% avg@3 on turns, tokens, and domains. Developers dig the quickstart with uv sync, docker build, and commands like `clawmark --tasks-dir tasks/hr` for instant ClawMark benchmarking.

Who should use this?

AI researchers benchmarking LLMs as coworker agents in multi-day workflows, like clinical assistants spotting drug interactions or HR coordinators managing calendars and sheets. Teams evaluating agent frameworks on multimodal, tool-heavy tasks beyond chat—think investment analysts cross-referencing emails and Notion dbs.

Verdict

Grab it if you're building ClawMark-style agent evals; the framework shines for realistic coworker sims despite 32 stars and 1.0% credibility signaling early maturity. Docs and quickstart are solid—run the full suite to beat the leaderboard.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.