SKYLENAGE-AI

General Agent Benchmark for OpenClaw, made by Qwen Team, Alibaba Group.

19
0
100% credibility
Found Apr 17, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

QwenClawBench is a benchmark for evaluating OpenClaw AI agents on 100 realistic tasks across various domains using isolated environments and robust scoring mechanisms.

How It Works

1
🔍 Discover the benchmark

You find this tool while looking for ways to test AI helpers on real tasks, like a leaderboard showing top performers.

2
📥 Get the testing kit

Download the ready-to-use test scenarios and sample challenges that mimic everyday AI helper work.

3
🔧 Prepare your setup

Make sure you have the basics ready, like a simple program runner and container tool, so everything works smoothly.

4
🤖 Connect your AI thinker

Link your favorite AI service so it can tackle the challenges, just like plugging in a smart brain.

5
🚀 Run the tests

Hit start to let multiple challenges run at once in safe, separate spaces, watching for any hiccups along the way.

6
📊 Check the scores

Review the detailed results, averages from repeat runs, and flags for any issues, so you trust what you see.

Get reliable insights

Celebrate having solid, trustworthy scores on how well your AI helper performs on real-world tasks!

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is QwenClawBench?

QwenClawBench is a Python-based benchmark for evaluating general agents running on the OpenClaw framework, packing 100 realistic tasks across domains like workflow orchestration, finance trading, security audits, and data analysis. It spins up isolated Docker workspaces mimicking real-user setups, runs parallel evals with anomaly detection, and scores via automated checks, LLM judging, or hybrid modes. Developers get reproducible results on agent reliability, complete with leaderboards and HuggingFace datasets.

Why is it gaining traction?

Unlike toy benchmarks, it stresses general agents on long-horizon, infrastructure-heavy scenarios with resumable runs and failure flagging—key for production evals where APIs flake or containers crash. Backed by Alibaba's Qwen team, it offers concurrent Docker execution and penalized hybrid scoring to cut LLM judge hacks, making agent comparisons trustworthy. Early adopters praise the realism for general agentur workflows over simplistic github general settings tests.

Who should use this?

AI researchers benchmarking general agents against models like Qwen or Claude on complex chains. OpenClaw builders validating skills for finance monitoring, multi-agent memory sharing, or system ops. Teams tired of unreliable evals for generalagent versicherung pipelines or github generate personal access token automations.

Verdict

Grab it if you're deep in general agents—strong design with Docker isolation and real tasks, but 1.0% credibility score and 19 stars signal early maturity; docs are solid but test coverage needs community general github_release love. Promising for agent evals, monitor for updates.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.