hanxunyu

hanxunyu / DepthVLM

Public

Official code repository for "Unlocking Dense Metric Depth Estimation in VLMs"

18
2
89% credibility
Found May 18, 2026 at 22 stars 3x -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

DepthVLM is a research project that adds depth estimation capabilities to vision-language AI models (VLMs). Built on Qwen3-VL-4B, it can analyze ordinary photos and produce dense, metric depth maps showing how far away objects are in meters. The project provides a pre-trained model available on HuggingFace, training scripts for two-stage fine-tuning (first training only the depth prediction head, then end-to-end refinement), and evaluation tools across multiple 3D sensing datasets (Argoverse2, Waymo, DDAD, nuScenes, ScanNet++, etc.). It aims to enable VLMs to understand the 3D geometric structure of scenes, not just recognize objects. The work comes from Zhejiang University and Tencent Hunyuan LLM, with a published arXiv paper (2605.15876).

How It Works

1
📚 You learn about it from research

You read about DepthVLM in an academic paper or see it mentioned online. The project combines an AI assistant that can see with depth-sensing technology.

2
🔍 You understand what it does

This tool takes any photo and predicts how far away each part of the image is, creating a 3D depth map from a flat picture.

3
🚀 You try the ready-made version

You download a pre-trained version that already knows how to estimate depth. It works right away on new photos without extra training.

4
📸 You load your own photos

You point it to pictures you want to analyze. The AI looks at each image and measures distances to objects.

5
🎯 You see the depth results

The system creates colorful depth maps showing nearby objects in warm colors and distant ones in cool colors. You can also build 3D point clouds.

6
You choose how deep to go
Quick start path

Use the provided demo script and get results in minutes

🧠
Custom training path

Follow the two-stage training process if you need to fine-tune for specific scenes

🎉 You unlock 3D understanding

You've successfully added depth perception to your AI workflow. Your photos now come with precise distance measurements for each pixel.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 22 to 18 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is DepthVLM?

DepthVLM is a vision-language model that estimates dense metric depth from images while maintaining general multimodal understanding. Built on Python using Qwen3-VL as its backbone, it tackles a gap in current VLMs: most can reason about depth qualitatively but struggle with precise metric measurements. The model outputs per-pixel depth maps in meters, making it useful for 3D reconstruction, robotics, and scene understanding tasks. Developers can download the pretrained DepthVLM-4B checkpoint from Hugging Face, run inference via provided demo scripts, or fine-tune the model using a two-stage training pipeline that first trains the depth prediction head, then fine-tunes end-to-end.

Why is it gaining traction?

The key differentiator is combining dense geometry prediction with the general reasoning capabilities of VLMs. Where specialized depth models like MiDaS produce relative depth, DepthVLM outputs metric depth in meters. The benchmark results show it outperforming other VLM-based approaches like DepthLM and Youtu-VL while maintaining faster inference. The official release includes both the pretrained model and the DepthVLM-Bench dataset, giving researchers a reproducible foundation to build on. The two-stage training approach is also practical for organizations with limited GPU resources who want to adapt the model to their own data.

Who should use this?

Computer vision researchers working on depth estimation who want to explore VLM-based approaches will find the benchmark and pretrained weights valuable. Robotics teams needing metric depth for navigation or manipulation could benefit from fine-tuning on domain-specific data. Academic researchers in 3D vision will appreciate the comprehensive data processing pipeline supporting 12 different datasets including NuScenes, Waymo, and ScanNet++. However, teams expecting production-ready code with extensive testing and documentation should be prepared to invest time in understanding the training pipeline.

Verdict

At 18 stars, DepthVLM is an early-stage academic project with a credibility score of 0.9% -- it is legitimate (published research with official code) but lacks community validation. The code is functional and well-documented for research use, but the sparse test coverage and limited production hardening mean teams should budget time for investigation before deploying. Worth exploring for research purposes; hold off for production until the project matures.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.