keyangds / interactive_evaluation
PublicA curated resource for interactive evaluation: framework, taxonomy, design principles, and benchmark collection.
This is an academic research paper that argues for a new science of testing AI systems. Traditional AI tests check if an AI gives a correct final answer, but modern AI often works by taking many actions over time—clicking buttons, writing code, having conversations. The paper provides a framework and visual map showing how to properly test these interactive AI systems, categorizes 56 existing benchmarks, and offers five clear principles for designing fair, comprehensive evaluations. It was written by researchers from UT Austin, CMU, and other universities, and is published on arXiv with an MIT open-source license.
How It Works
You come across a paper about evaluating AI systems that interact with the world, not just answer questions.
The paper explains why traditional AI testing isn't enough when AI uses tools, browsers, or talks to people.
A visual map shows you all the different ways AI can be tested interactively, organized by what it interacts with and how it's judged.
You discover 56 existing tests organized into three stages of evolution, from simple answers to complex interactions.
You learn five clear rules for building fair, comprehensive tests for interactive AI systems.
The paper highlights what's missing—like testing AI that remembers things across long conversations.
You use these principles to design better tests for your own AI project or research.
Your AI system gets tested more fairly, catching problems that simple answer-checking would miss.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.