A benchmark for evaluating contextual agents on realistic multimodal personal-computer environments with profiling and factual-retention tasks.
HippoCamp is a benchmark with realistic personal computer file environments and evaluation tools for testing AI agents on multimodal search, retrieval, and reasoning tasks.
How It Works
You find this benchmark on GitHub or its project page and learn it's for testing AI helpers on everyday computer files like documents, photos, emails, and videos.
Grab ready-made collections of personal files from the dataset links, including small subsets to start quickly.
Install a simple Python environment and prepare the file collections so everything is ready to test.
Start one of the personal computer setups with a single command, opening a safe Docker world full of realistic files.
Choose from ready agents like ChatGPT, Claude, or Gemini to explore and answer questions about the files.
Pose natural questions like 'What was my travel plan last month?' and watch the AI search files, read contents, and reason step-by-step.
Get automatic evaluations, compare agents on leaderboards, and understand strengths in search, perception, and reasoning.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.