LiberCoders / CLI-Gym

Public

Official Implementation of "CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion"

118

100% credibility

Found Feb 22, 2026 at 59 stars 2x -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

CLI-Gym generates large-scale command-line tasks for training AI agents on debugging and environment repair by simulating failures in codebases.

How It Works

🔍 Discover CLI-Gym

You hear about this clever tool from a research paper or dataset site, promising thousands of tricky challenges to train smart helpers for fixing computer messes.

📥 Grab it quickly

With one simple command, you download and set everything up, including helpers for testing and running agents, feeling the excitement build.

🔗 Link your smart thinker

You share a private note with an AI service so it can help create the challenges, making setup feel personal and secure.

🏗️ Prepare a codebase world

You pick a ready-made project environment and build a safe playground for it, watching your virtual workspace come alive.

💥 Create messy scenarios

Tell it how to break things naturally using real tests, generating sneaky failures that smart agents must unravel.

🛠️ Build fix-it puzzles

From the broken states, craft detailed stories of what went wrong, complete with hints or challenges for agents to solve.

🎉 Dataset ready to train heroes

You now have hundreds of realistic puzzles to supercharge AI agents, boosting their skills on tough real-world fixes like never before.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 59 to 118 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is CLI-Gym?

CLI-Gym is a Python toolkit that automatically generates environment-intensive CLI tasks for training and benchmarking agentic coding models. It inverts healthy Docker-based codebases into failure states using LLMs to simulate agent actions, then assembles repair tasks complete with failing unit tests, Dockerfiles, and evaluation setups. Developers get a simple CLI (`cg pull` to build runtimes, `cg build` to crank out tasks) plus 1,655 ready-to-use instances on Hugging Face from 29 repos, solving the pain of manually crafting scalable Terminal-Bench-style benchmarks.

Why is it gaining traction?

Unlike Terminal-Bench's hand-curated 229 tasks from 93 contributors, CLI-Gym scales via agentic inversion for 1,655 diverse instances at 2.3B token cost, with richer tests (20 fail-to-pass per task). It integrates seamlessly with OpenHands and Terminal-Bench via quick_install.sh, spitting out verifiable Docker tasks that boost model scores—like 46% Pass@1 on Terminal-Bench v1. The hook: one command yields production-grade eval data without repo mining.

Who should use this?

AI researchers evaluating CLI agents on realistic env manipulation, like terminal debugging in OpenHands. Teams fine-tuning coders (e.g., Qwen variants) for SWE-bench forks. Agent devs needing quick, high-volume tasks beyond static benchmarks.

Verdict

Worth forking for agent evals—solid arXiv-backed pipeline, polished CLI, and HF dataset make it instantly usable despite 31 stars and 1.0% credibility score. Early beta (fresh 2026 release), so test on non-critical workflows first.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

118

Stars

Forks

Followers

Base stars: 118 stars

Bonus: AI verified quality (100%)

Account age: 173 days

Repo age: 21 days

License: MIT

Updated: Mar 02, 2026