catello09 / agent-eval-ts

Public

Agent evaluation & benchmarking for TypeScript: test suites, LLM metrics, caching, OpenAI-compatible judge, JUnit/HTML/MD reports, Docker, GitHub Actions.

ai-agents benchmarking evaluation llm testing

100% credibility

Found Apr 18, 2026 at 20 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

TypeScript

AI Summary

agent-eval-ts is a TypeScript framework for defining test suites to evaluate AI agents on metrics like accuracy, latency, cost, and tool usage, with reporting, caching, and model comparison features.

How It Works

🔍 Discover the tester

You find a handy tool that lets you check how well your AI assistant handles everyday tasks.

📥 Set it up

Download it to your computer and get everything ready in a few minutes.

📝 List your checks

Write down simple questions and right answers to see what your AI should do.

🔌 Pick your AI

Choose a pretend AI for quick tests or connect your real one if you want true results.

🚀 Launch the tests

Press go and let it run all your checks automatically, watching the progress.

📊 See your scores

Open the colorful report showing pass rates, speeds, and smart insights.

🎉 Boost your AI

Celebrate knowing exactly how good your assistant is and where to improve it.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 20 to 20 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is agent-eval-ts?

agent-eval-ts is a TypeScript framework for evaluating AI agents with test suites that check exact matches, semantic similarity, JSON schema validation, tool calls, latency, token usage, and cost. You define cases with inputs and expectations, plug in your agent handler—mock, OpenAI, or custom—then run batch evals via CLI, REST API, or code for JUnit, HTML, Markdown, or JSON reports. It handles caching, model comparisons, LLM-as-judge scoring, and regression detection against baselines, all in Node.js with Docker and GitHub Actions support.

Why is it gaining traction?

Unlike Python-heavy agent evaluation tools from AWS Labs, Microsoft, or MLflow, this delivers a lightweight, TS-native agent evaluation framework that loads TypeScript suites directly via tsx—no transpiling hassles. Developers dig the CLI for quick `--suite path --model gpt-4o-mini` runs, OpenAI-compatible judge for fuzzy evals, and CI-ready JUnit exports plus GitHub Actions workflows. It covers key agent evaluation methods and metrics without bloat, making agent GitHub repos easier to benchmark reliably.

Who should use this?

TypeScript devs building LLM agents for tools like GitHub Copilot CLI, VSCode extensions, or custom bots needing reproducible tests in CI/CD pipelines. Ideal for teams tracking agent performance regressions during model swaps or fine-tunes, especially if you're dodging Python deps in agent GitHub repos. Suits agent evaluation surveys or integrating evals into GitHub Actions for code agents.

Verdict

With 20 stars and a 1.0% credibility score, it's early-stage but boasts solid docs, full test coverage, and production-ready Docker—worth a spin for TS agent evals over clunky alternatives. Maturity lags big players, so pair with baselines until it grows.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 20 stars

Bonus: AI verified quality (100%)

Account age: 68 days

Repo age: 8 days

License: MIT

Updated: Apr 18, 2026