Official code repository for "Unlocking Dense Metric Depth Estimation in VLMs"
DepthVLM is a research project that adds depth estimation capabilities to vision-language AI models (VLMs). Built on Qwen3-VL-4B, it can analyze ordinary photos and produce dense, metric depth maps showing how far away objects are in meters. The project provides a pre-trained model available on HuggingFace, training scripts for two-stage fine-tuning (first training only the depth prediction head, then end-to-end refinement), and evaluation tools across multiple 3D sensing datasets (Argoverse2, Waymo, DDAD, nuScenes, ScanNet++, etc.). It aims to enable VLMs to understand the 3D geometric structure of scenes, not just recognize objects. The work comes from Zhejiang University and Tencent Hunyuan LLM, with a published arXiv paper (2605.15876).
How It Works
You read about DepthVLM in an academic paper or see it mentioned online. The project combines an AI assistant that can see with depth-sensing technology.
This tool takes any photo and predicts how far away each part of the image is, creating a 3D depth map from a flat picture.
You download a pre-trained version that already knows how to estimate depth. It works right away on new photos without extra training.
You point it to pictures you want to analyze. The AI looks at each image and measures distances to objects.
The system creates colorful depth maps showing nearby objects in warm colors and distant ones in cool colors. You can also build 3D point clouds.
Use the provided demo script and get results in minutes
Follow the two-stage training process if you need to fine-tune for specific scenes
You've successfully added depth perception to your AI workflow. Your photos now come with precise distance measurements for each pixel.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.