TaimoorKhan10 / replayd

Public

Turn failed AI agent runs into replayable regression tests. Catch regressions before you ship.

www.stonepathlab.net agent-ops agent-testing ai-agents ai-infrastructure ai-reliability

89% credibility

Found May 31, 2026 at 10 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

Replayd is an open-source testing tool that turns failures from AI agents into reusable regression tests, helping teams catch the same bugs returning after prompt or model changes before shipping to production.

How It Works

🔍 Discover a problem with your AI agent

Your AI agent makes a mistake in production — maybe it approved something it shouldn't have, or gave wrong advice to a customer.

📸 Capture the failure

You wrap your agent's run in a special recording block that saves everything: what the user asked, what the agent did, and every tool it called.

🏷️ Mark it as a test case

You write a short note explaining what went wrong — like 'agent approved a refund over the policy limit' — and save it as a regression test.

🔄 Make changes to your agent

Weeks later, your team updates the prompt, switches to a new AI model, or fixes some other part of the agent.

⚡ Run regression tests before shipping

Before your changes go live, you run all your saved tests against the new version to check if old bugs have returned.

See the results

🚫

Bug caught

The same mistake happened again — your release is blocked until you fix it

✅

All clear

No old bugs returned — your changes are safe to ship

🚀 Ship with confidence

Your AI agent is now protected by a safety net that catches regressions before they reach your users.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 10 to 10 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is replayd?

replayd is a Python library that turns failed AI agent runs into replayable regression tests. When your AI agent screws up in production, you capture that failure, save it as a test, and replay it before every deployment to make sure the same bug never comes back. The tool grades on deterministic facts (which tools got called, with what arguments) rather than fuzzy output matching, so tests stay reliable even when LLMs generate different text for the same correct behavior. Optional semantic grading using an LLM-as-judge handles failures that require reading context, like "did the agent approve a refund that exceeds policy?"

Why is it gaining traction?

This solves a real pain point: AI agents regress silently. You fix a bug, update a prompt, or switch models, and the same mistake creeps back without anyone noticing until users complain. replayd makes that impossible by blocking releases when known failures resurface. Unlike observability tools that tell you what happened after the fact, this is an active release gate. The structural grading approach (check tool calls first, LLM second) is smart design -- it avoids expensive LLM calls for straightforward failures and keeps most tests fast and deterministic.

Who should use this?

Teams deploying AI agents in production who need confidence that shipped changes won't reintroduce known bugs. If you're running customer-facing agents with tool-calling capabilities (approvals, refunds, searches, etc.), this gives you regression coverage that traditional software testing never provided for AI behavior. It's especially useful for teams iterating fast on prompts or models where silent regressions are the biggest deployment risk.

Verdict

replayd addresses a genuine gap in AI deployment tooling with a well-designed, pragmatic approach. However, the credibility score sits at roughly 0.9% with only 10 stars and version 0.1.2 -- this is very early-stage. The code is clean and the concept is solid, but I'd want to see more battle-testing before staking production releases on it. Worth exploring in a non-critical project to evaluate fit, but keep an eye on the roadmap.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 10 stars

Penalty: Very new repo (0d): -70%

Bonus: AI verified quality (90%)

Account age: 1,043 days

Repo age: 0 days

License: MIT

Updated: May 30, 2026