Overshoot-ai / vlm-benchmarks

Public

2500+ VLM benchmarks, auto-updated daily from arXiv

overshoot.ai

100% credibility

Found Apr 14, 2026 at 12 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

An automatically updated, open catalog of thousands of benchmarks for evaluating vision-language models, multimodal LLMs, and video understanding models, sourced from arXiv papers.

How It Works

🔍 Discover the catalog

While researching tests for AI models that understand images and videos, you stumble upon this organized collection of over 2,700 benchmarks.

📖 Explore the overview

You read the welcoming page with fun charts showing benchmarks grouped by topics like video understanding or medical imaging, and when they were released.

📊 See the big picture

The eye-catching visuals help you quickly grasp trends, like which types of tests are most popular right now.

📥 Download your list

Grab the ready-to-use spreadsheet or data file packed with details like test names, descriptions, and paper links.

🔎 Find what you need

Open the search tool or your file to filter by category, such as safety checks or spatial reasoning, and pick the perfect benchmarks.

➕ Share a discovery

If you spot a missing test from a recent paper, simply share its details to help everyone benefit.

🎉 Benchmarks at your fingertips

Now you have a fresh, reliable list to evaluate AI vision models, saving hours of hunting and ensuring you stay up-to-date.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 12 to 12 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is vlm-benchmarks?

This Python repo delivers a 2500+ catalog of VLM benchmarks, llm benchmarks, and video understanding evals scraped daily from arXiv papers. It solves the hassle of hunting scattered multimodal datasets by providing structured JSON and CSV files with details like categories (22 total, including vlm ocr benchmarks), sample counts, modalities, task types, and repo links. Load it in Python for quick filtering, or hit the linked search app for an llm benchmarks leaderboard.

Why is it gaining traction?

Auto-updated daily via arXiv scans sets it apart from static lists—no more outdated repos. Developers grab ready-to-use data with charts on trends like benchmarks by quarter or category, plus direct GitHub/Hugging Face links for code and datasets. The schema covers everything from visual reasoning to long video, making it a one-stop for 2500+ benchmarks without manual curation.

Who should use this?

ML engineers benchmarking VLMs or multimodal LLMs before deployment. Researchers comparing models on safety, medical, or document OCR tasks. Devs prototyping video agents needing fresh evals with repo access.

Verdict

Grab it if you need a daily-fresh VLM benchmark hub—JSON/CSV exports are immediately useful despite 11 stars and 1.0% credibility score signaling early maturity. Solid docs and MIT license make it low-risk to fork, but watch for growth in coverage.

(187 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 12 stars

Bonus: AI verified quality (100%)

Account age: 136 days

Repo age: 5 days

License: MIT

Updated: Apr 14, 2026