evolvent-ai

Terrarium: Multi-turn data engine for evaluating and optimizing LLM agents in living environments.

16
0
100% credibility
Found Apr 15, 2026 at 17 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

Terrarium lets you create dynamic test worlds for AI agents, where they handle multi-step tasks across changing environments like email and databases, then scores their performance.

How It Works

1
🔍 Discover Terrarium

You hear about Terrarium, a fun way to test AI assistants in realistic everyday scenarios like handling emails or updating calendars.

2
📦 Set up your playground

You grab the tools with a simple download and prepare connections to services like email or calendars so everything works smoothly.

3
Build your first scenario

You write a simple story in plain steps, like 'check email and add a meeting', mixing real-world actions that change over time.

4
🤖 Choose your AI helper

You pick a smart AI agent ready to go, like one that thinks step-by-step, and connect it to your scenario.

5
▶️ Launch the adventure

With one command, you start the test and watch your AI navigate the changing world, looping and branching as needed.

6
📊 Review the results

You see clear scores, detailed logs of what happened, and highlights of successes or where it went off track.

🎉 Master your agents

Your AI gets better benchmarks, you collect real training data, and you're set to build even smarter assistants.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 17 to 16 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is Terrarium?

Terrarium is a Python data engine for evaluating and optimizing LLM agents through multi-turn interactions in living environments that evolve like a real terrarium—emails arrive, databases update, files appear between turns. You define tasks in plain Python, composing capabilities like email, Postgres, Notion, calendar, or workspace sandboxes to simulate dynamic workflows, then drive agents (Claude Code, OpenClaw, or custom) and check outcomes programmatically. Run via CLI with `terrarium run -c config.toml` to generate trajectories, benchmarks, or training data.

Why is it gaining traction?

Unlike static QA or single-turn benches, Terrarium handles Phase 3 complexity: mutating environments, loops/branches, proactive patterns like heartbeats/webhooks, all without YAML configs. Developers dig the pure Python tasks for rapid iteration on agent behaviors in realistic setups, plus Docker-backed isolation and built-in metrics like pass@k. Early adopters praise demo tasks mimicking personal assistants monitoring terrarium tiere or updating terrarium schildkröte schedules.

Who should use this?

AI researchers benchmarking LLM agents beyond coding, teams collecting multi-turn trajectories for fine-tuning, or workflow devs testing proactive agents in email/calendar/Postgres flows. Ideal for those building custom evals like tau2 retail tasks or optimizing assistants for monitoring experiments.

Verdict

Worth prototyping for agent evals despite 16 stars and 1.0% credibility—solid docs, demos, and Python API shine, but non-commercial CC BY-NC 4.0 license limits production. Early but punches above weight for living environments; watch the roadmap for more capabilities.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.