SarahXC

Bash harness to run Karpathy's autoresearch with Codex CLI in a loop. Includes A/B testing framework for comparing models.

19
0
100% credibility
Found Mar 19, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Shell
AI Summary

Scripts that wrap an AI coding tool in a loop to run autonomous experiments improving a neural network training script, with support for comparing different AI models on a GPU machine.

How It Works

1
🔍 Discover the Research Tool

You find a handy kit that lets AI brains compete to smarten up a learning program all by themselves.

2
🖥️ Rent a Super Computer

You pick a cloud service and launch a powerful virtual machine with a top graphics card for big calculations.

3
⚙️ Run Easy Setup

You follow simple steps to prepare the machine, downloading sample data and tools so everything is ready.

4
🔗 Link Your AI Service

You connect your AI account so the smart models can get to work thinking and creating.

5
🚀 Start the AI Race

You pick two AI models, set a time like 6 hours each, and launch them to take turns improving the program.

6
📈 Watch Improvements Happen

You peek at the logs anytime to see experiments running, scores dropping, and which changes stick.

🏆 See Who Wins

You get a clear comparison table showing the best scores and how much better each AI made the program.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is codex-autoresearch-harness?

This bash harness script wraps OpenAI's Codex CLI in a loop to run Karpathy's autoresearch experiments continuously on a GPU VM, letting LLMs autonomously tweak a neural net training script over fixed 5-minute cycles. It persists progress via git commits and a results log, solving Codex's lack of native looping by handling one iteration per call. Users get single-model runs, an A/B framework for comparing models like GPT-5.4 vs GPT-5.3-Codex, and a post-run CLI tool for side-by-side analysis—all via simple bash commands after a quick VM setup.

Why is it gaining traction?

It stands out by turning stateless Codex exec into a persistent agent via bash scripting, complete with bash github clone for Karpathy's autoresearch repo and built-in A/B testing that outputs clear metrics like val_bpb improvements. Developers dig the no-frills harness bash script that bypasses sandbox issues for GPU training, plus a battle-tested readme with real results showing GPT-5.4 crushing baselines by 2.46%. The CLI-driven workflow—launch_ab.sh for comparisons, compare_results.sh for diffs—makes model benchmarking dead simple without custom coding.

Who should use this?

ML engineers benchmarking LLMs on optimization tasks, like pitting OpenAI models against alternatives in Karpathy's autoresearch setup. Teams evaluating codex cli performance on GPU VMs for code gen and hyperparam tuning. Anyone scripting bash github actions or bash github login flows to automate agentic experiments without building from scratch.

Verdict

Grab it if you're testing LLMs on autoresearch—excellent docs and setup script lower the barrier despite 19 stars and 1.0% credibility score signaling early maturity. Solid for quick A/B runs, but fork and monitor for production hardening.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.