Calibre-Labs / reforge-ai-evals

Public

Market Map agent eval suite for the Reforge AI Evaluation course

100% credibility

Found Apr 23, 2026 at 32 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

This repository offers prompts, test datasets, and scoring methods to evaluate an AI agent that ranks top companies in markets based on user queries, as demonstrated in a Reforge AI evaluation course.

How It Works

🔍 Find the AI Market Guide

You stumble upon this handy collection from a course that helps test and improve an AI assistant for ranking top companies in any market.

📋 Set up your testing playground

You create free accounts for a testing area and your AI service so everything is ready to experiment safely.

🛠️ Add quick helper tricks

You run a one-time setup to unlock special commands in your AI chat that make building tests super easy.

📊 Load example questions

You copy ready lists of real-world questions like 'team chat apps' into your playground to challenge the AI.

🚀 Run the AI on tests

You paste a smart instruction set and watch the AI generate ranked lists of top companies with reasons.

🔍 Score the results

You apply simple checks to see if rankings are accurate, backed by facts, and handle tricky questions well.

🎉 Build a reliable market scout

Your AI now confidently maps any market with top picks and proof, ready for real use without surprises.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 32 to 32 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is reforge-ai-evals?

This Python suite evaluates AI agents that generate market maps—ranked top-3 players for queries like "team chat" or "AI security startups Okta could acquire," complete with metrics and citations. It provides Braintrust-ready datasets covering diverse query types (historical, jargon-heavy, edge cases) tagged via a User Input Grid for full coverage, plus code-based checks for output structure and LLM judges using Anthropic models for semantic quality like ranking accuracy. Developers get a repeatable eval loop to iterate prompts without hallucinations or gaps in market analysis.

Why is it gaining traction?

Unlike generic LLM benchmarks, it mixes fast deterministic scorers (company count, metrics presence) with calibrated LLM judges for nuanced checks like edge-case handling or metric scoping, all wired for Braintrust experiments and regression testing. The Claude Code skills automate dataset expansion from support tickets, making it dead simple to build production-grade evals for market mapping agents. Low stars but hooks devs tired of ad-hoc testing on github market analysis tasks.

Who should use this?

AI engineers tuning research agents for market profile, market size, or market sentiment queries in VC, sales, or product teams. Prompt engineers at startups evaluating single-turn market map outputs against real-world diversity like historical snapshots or geographic constraints. Reforge course folks or anyone adapting it as a baseline for agent evals.

Verdict

Grab it if you're building market mapping tools—excellent docs and ready datasets make it a strong learning starter despite 32 stars and 1.0% credibility score signaling early maturity. Skip for production without custom judges unless your agent fits the fixed output format.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 32 stars

Bonus: AI verified quality (100%)

Account age: 78 days

Repo age: 7 days

Updated: Apr 23, 2026