capgym

capgym / cap-x

Public

A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

19
0
100% credibility
Found Mar 26, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

CaP-X is an open framework for testing and training AI agents that generate code to control simulated robots in manipulation tasks like stacking and assembly.

How It Works

1
💡 Discover CaP-X

You stumble upon CaP-X, a fun playground where AI learns to guide robots in everyday tasks like stacking blocks or wiping spills.

2
🤖 Pick a robot adventure

Choose a challenge like lifting a cube or assembling nuts – it's like giving your robot a puzzle to solve with smart code.

3
📦 Set up your robot world

Download the tools to your computer and prepare simulated robot environments with simple steps.

4
🧠 Link an AI thinking partner

Connect a helpful AI like Gemini so it can watch the scene and write code to control the robot.

5
▶️ Launch and watch magic

Hit start to see the AI generate code, the robot move step-by-step, and learn from each try.

6
📈 Review and improve

Check videos and scores of what worked, tweak the AI's instructions, and run again for better results.

🎉 Robot masters the task!

Your AI-robot team stacks blocks perfectly or wipes spills clean – ready for real-world robot helpers.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is cap-x?

Cap-X is a Python framework for benchmarking coding agents—LLMs and VLMs that output Python code to control robots in simulation. It delivers Gymnasium environments for 39 manipulation tasks across Robosuite, LIBERO-PRO, and BEHAVIOR simulators, plus a tiered benchmark (S1-S4 single-turn, M1-M4 multi-turn) testing abstraction, visual grounding, and interaction modes. Users run evals via simple CLI like `uv run capx/envs/launch.py --config-path task.yaml --model gemini-pro`, with auto-launched perception servers and web UI.

Why is it gaining traction?

This LLM benchmarking framework stands out by composing vision (SAM3, OWL-ViT), motion planning (cuRobo, PyRoKi), and control primitives into agent code, enabling multi-turn visual differencing and parallel ensembling without custom training. Robotics devs grab it for competitive benchmarking framework metrics on code-as-policy, plus RL tools (GRPO/VeRL) that transfer sim policies to real hardware with low sim-to-real gap—far beyond basic env wrappers.

Who should use this?

Robotics researchers evaluating VLMs on dexterous tasks like nut assembly or cube restacking; AI agent builders testing code-gen reliability in physical sims; sim-to-real teams needing standardized python benchmarking framework scores before hardware deploys.

Verdict

Solid start from NVIDIA/Berkeley/Stanford labs with arXiv paper and CUDA-ready CLI, but 19 stars and 1.0% credibility signal early maturity—great docs and regression tests, but tweak configs expectantly. Worth a spin for agent benchmarking if you've got GPU sims.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.