marived / vlm-probe

Public

Probing fine-grained perception in open-source vision-language models — companion code for a writeup.

85% credibility

Found May 25, 2026 at 16 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

VLM-Probe is a research evaluation tool from Beihang University that tests how well AI vision systems understand images. It works by showing an AI model a set of carefully designed images and asking it specific questions about what it sees—such as counting objects, identifying colors, reading signs, understanding spatial relationships, and detecting partially hidden objects. The tool then scores each AI model's performance and reveals which visual perception tasks are easy or difficult for that particular system. Researchers use this to understand the strengths and weaknesses of different AI vision models.

How It Works

🔬 You discover the research

You come across an academic paper about testing AI vision systems and learn that this tool can reveal exactly where AI struggles to see.

📦 You install the testing tool

With one simple command, you set up the evaluation harness on your computer and everything is ready to go.

🖼️ You download the test images

You pull down a collection of carefully designed images that test different aspects of visual understanding.

🤖 You pick an AI model to test

You choose which AI assistant you want to examine—it could be LLaVA, Qwen-VL, or another vision system.

⚡ The tests run automatically

The tool shows your AI images and asks it questions: how many objects, what colors, where things are located, and what signs say.

You review the results

📄

View raw data

Look at the detailed record of every question and answer

📈

Compare models

See how different AI models stack up against each other

🌐

Generate a report

Create a clean webpage showing the findings

💡 You understand the AI's vision

You now know exactly which visual tasks your AI handles well and where it struggles to truly see what's in an image.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 16 to 16 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is vlm-probe?

This is a Python evaluation framework for testing whether vision-language models actually "see" what's in images or just hallucinate from text patterns. It runs a VLM against five fine-grained perception tasks—counting objects, spatial relationships, color attributes, reading text in images, and detecting partial occlusion—then scores exact-match accuracy. The tool loads models through the Hugging Face transformers library and defines tasks as simple YAML files. You get back per-task accuracy scores and raw prediction logs to dig into failure cases. Run `python -m vlmprobe.run --model llava-hf/llava-1.5-7b-hf --tasks tasks/*.yaml --out results.json` and you have numbers in seconds.

Why is it gaining traction?

LLM developers increasingly need to benchmark where their models genuinely perceive visual details versus where they guess from training patterns. This framework gives a reproducible, standardized probe that isolates fine-grained perception from general language fluency. The YAML-based task definition makes adding new evaluation scenarios trivial, and the built-in comparison scripts let you contrast model versions with a single command. For anyone publishing model weights or writing papers on vision-language capabilities, this provides the granular accuracy breakdown that reviewers increasingly demand.

Who should use this?

Machine learning engineers fine-tuning or evaluating VLMs, researchers benchmarking vision-language models against baselines, and practitioners building applications where accurate visual counting or spatial reasoning matters. If you're shipping a VLM-powered product and need to prove it doesn't just guess colors or numbers, this gives you defensible numbers. Less useful for teams with established evaluation pipelines or developers just playing with APIs.

Verdict

With only 16 stars, this is an early-stage project—the 0.8500000238418579% credibility score reflects that maturity gap. The code is clean and MIT-licensed, but documentation is sparse and the image dataset URL is marked TODO, which raises concerns for reproducibility today. Treat this as readable scaffolding for building your own probing suite rather than a turnkey solution. Worth a look if you're building evaluation infrastructure and want clean Python to extend; look elsewhere if you need production-ready tooling today.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 16 stars

Penalty: Very new repo (1d): -70%

Bonus: AI verified quality (85%)

Account age: 127 days

Repo age: 1 days

License: MIT

Updated: May 25, 2026