VibeBench / VibeSearchBench

Public

🔍 The hardest search benchmark in the wild — vague, multi-turn, proactive. 200 long-horizon tasks with persona-driven progressive disclosure, scored by verifiable schema-free knowledge-graph evaluation. No vibes, just triplet F1.

vibebench.github.ioVibeSearchBench.github.io agentic-ai benchmark llm proactive-agent search

478

89% credibility

Found Jun 01, 2026 at 478 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

VibeSearchBench is a research benchmark that tests AI assistants on complex, multi-turn research tasks where users gradually reveal their information needs. It evaluates how well AI can search the web, ask follow-up questions, and produce accurate structured knowledge graphs.

How It Works

🔬 You discover a research challenge

You learn about a benchmark that tests how well AI assistants handle real research tasks where users don't say everything upfront.

🛠️ You connect your AI assistant

You set up your AI model and give it tools to search the web, visit pages, and run calculations.

🎯 You start a research session

Your AI receives a vague question and begins searching, while a simulated user gradually reveals more details about what they really need.

Choose how your AI works

💬

Ask questions as it goes

The AI talks back and forth with a simulated user, asking follow-up questions to clarify needs

🔎

Search independently first

The AI does all its research on its own before presenting final results

🔍 Your AI searches and learns

The AI searches the web, visits relevant pages, and runs code to find answers, asking clarifying questions along the way.

📊 You see how well it did

The system compares what your AI found against the correct answers and shows you detailed scores for accuracy and completeness.

✅ Research complete

Your AI has gathered comprehensive information and you can see exactly how accurate and complete its findings were.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 478 to 478 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is VibeSearchBench?

VibeSearchBench is a benchmark for evaluating AI agents on real-world search tasks. It tests how well agents handle vague, multi-turn queries where users progressively reveal their information needs over time. The evaluation uses knowledge graph triplet matching rather than simple answer matching, measuring whether agents can build accurate structured knowledge from incomplete, evolving queries. Written in Python, it provides both a GeneralAgent implementation and OpenClaw integration for running experiments.

Why is it gaining traction?

The benchmark fills a gap in AI evaluation—most tests check if models answer questions, but this one measures whether they can actually search and gather information effectively. The triplet F1 metric gives developers concrete numbers to compare agent performance. The persona-driven user simulation creates realistic interaction patterns that mirror how people actually search, making it more useful than toy benchmarks for production planning.

Who should use this?

AI developers building search or research agents will find the benchmark directly applicable for measuring their work. Teams building multi-turn assistants or chatbots can use it to stress-test how well their systems handle incomplete user intent. Researchers studying agent architectures will benefit from the structured evaluation framework and the 200-task dataset covering professional and daily-life scenarios.

Verdict

The 0.8999999761581421% credibility score reflects solid academic backing, though the 478 stars indicate the project is still early-stage. It's a practical choice for teams serious about search agent quality, but be prepared for a steeper learning curve given limited documentation. The benchmark itself is well-designed; the ecosystem around it just needs time to mature.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

478

Stars

Forks

Followers

Base stars: 478 stars

Penalty: New account (11d): -70%

Penalty: New account with popular repo: -90%

Bonus: AI verified quality (90%)

Account age: 11 days

Repo age: 12 days

License: MIT

Updated: Jun 01, 2026