ArchishmanSengupta / autovoiceevals

Public

A self-improving loop for voice AI agents. Uses karpathy's autoresearch as foundation.

autoresearch claude smallest-ai system-prompt-optimization voice-agents

100% credibility

Found Mar 16, 2026 at 39 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

AutoVoiceEvals automatically generates adversarial test scenarios, simulates conversations with voice AI agents, evaluates performance using AI judges, and iteratively improves the agent's instructions.

How It Works

🔍 Discover AutoVoiceEvals

You find this friendly tool on GitHub that helps make your voice assistant smarter by testing tough conversations.

📝 Describe Your Voice Helper

You simply write a short note about what your voice agent does, like its job, services, and hours.

🔗 Connect Your Services

You link your voice platform and AI thinking service so everything works together smoothly.

Choose Your Adventure

🔄

Ongoing Research

Let it keep testing and refining forever until your agent shines.

⚡

Quick Pipeline

Run a fast round of tough tests, fix problems, and verify the wins.

▶️ Start the Magic

Hit go and watch it create tricky calls, chat with your agent, score responses, and suggest fixes.

📈 Enjoy a Smarter Agent

You get colorful charts, detailed reports, and an upgraded voice helper that handles real challenges better.

Sign up to see the full architecture

4 more

Star Growth

See how this repo grew from 39 to 38 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is autovoiceevals?

Autovoiceevals is a Python tool for running self-improving loops on voice AI agents, using Karpathy's autoresearch as its foundation. It automates adversarial testing via CLI commands like `python main.py research` for endless prompt tweaks or `pipeline` for one-shot attack-improve-verify cycles, evaluating against custom scenarios with voice quirks like accents and noise. Developers get scores blending criteria compliance, latency, and CSAT, plus graphs and logs in a results folder.

Why is it gaining traction?

This self-improving AI GitHub project stands out by targeting voice agents specifically—generating tough caller scripts that probe failures in real convos—while integrating directly with Vapi or Smallest APIs to swap prompts live. The autoresearch loop proposes single changes, tests on a fixed eval suite, and keeps winners, delivering tangible score lifts without manual prompt engineering. Output like progression charts and best_prompt.txt makes sharing results dead simple.

Who should use this?

Voice AI builders on Vapi or Smallest deploying customer-facing agents for scheduling, support, or sales. Agent teams iterating prompts for production robustness, especially those hitting walls with edge cases like interruptions or manipulative callers. Python devs experimenting with self-improving agents beyond text.

Verdict

Grab it if you're on Vapi/Smallest and want automated voice evals—solid CLI and outputs punch above its 38 stars and 1.0% credibility score. Early stage with thin docs, so expect tweaks, but the Karpathy-inspired loop delivers quick wins for agent tuning.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

851

Followers

Base stars: 38 stars

Penalty: Very new repo (1d): -70%

Bonus: AI verified quality (100%)

Account age: 2,007 days

Repo age: 1 days

License: MIT

Updated: Mar 16, 2026