zlab-princeton

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

40
0
100% credibility
Found Apr 14, 2026 at 40 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

VisionFoundry is an open-source toolkit for generating synthetic image-question-answer datasets to improve visual perception in AI models, with scripts for fine-tuning popular vision-language models.

How It Works

1
🔍 Discover VisionFoundry

You find this helpful tool from Princeton researchers while looking for ways to create custom image datasets for AI vision training.

2
🛠️ Get everything ready

You install the easy starter tools and link your favorite AI image makers so they can help create pictures.

3
📝 Describe your idea

You type a simple description of the visual task, like spotting colors or positions in scenes, and choose how many examples to make.

4
Magic happens

You hit go, and it automatically dreams up detailed scenes, generates realistic images, smart questions, and perfect answers—all verified to match perfectly.

5
📁 Collect your dataset

You get a ready-to-use folder of images paired with questions and answers, plus details on scenes and styles used.

6
🤖 Train your visual AI

You follow the friendly guides to teach an AI model like Llama or Qwen using your new dataset, watching it learn to see and answer.

🎉 Smarter vision AI ready

Your custom-trained model now excels at understanding images, ready for your projects or research—with a published dataset to share too!

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 40 to 40 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is VisionFoundry?

VisionFoundry is a Python tool for generating synthetic images and VQA datasets to teach VLMs visual perception tasks like spatial relations or object attributes. Feed it a single task keyword, and it spits out verified images, questions, answers, and training-ready annotations in ms-swift format—single or multi-image modes. No more scraping real data; get custom datasets fast using OpenAI and Gemini APIs for prompts, generation, and consistency checks.

Why is it gaining traction?

It stands out by automating the full pipeline from task description to verifiable synthetic data, skipping manual annotation hassles that plague VLM training. Developers love the CLI simplicity—one command yields thousands of diverse, perception-focused examples, with options for story chains across images. Early results show real gains on benchmarks, hooking those tired of noisy real-world datasets.

Who should use this?

ML engineers fine-tuning VLMs on perception-heavy tasks, like spatial reasoning or multi-object scenes. VLM researchers at startups or labs needing quick synthetic data boosts without labeling teams. Python devs experimenting with ms-swift SFT on models like Qwen-VL or Llama Vision.

Verdict

Worth a spin for synthetic data experiments—Princeton-backed with a solid paper, but at 40 stars and 1.0% credibility, it's early-stage with basic docs. Pair it with VLMEvalKit for quick wins, but expect API costs and tweaks for production scale.

(178 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.