opendatalab

A diffusion-based framework for document OCR that replaces autoregressive decoding with block-level parallel diffusion decoding. Topics

82
5
100% credibility
Found Mar 26, 2026 at 82 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

MinerU-Diffusion is an open-source AI system that extracts structured text, tables, formulas, and layouts from document images using efficient diffusion-based decoding.

How It Works

1
🔍 Discover MinerU-Diffusion

You hear about a smart tool that turns photos of documents into clean, editable text, tables, and formulas.

2
💻 Prepare your setup

You get everything ready on your computer so the tool can work with your document images.

3
📥 Download the brain

You grab the special knowledge files that let the tool understand documents.

4
🖼️ Upload your document photo

You pick a picture of a page—like a scanned report or book—and the tool gets excited to read it.

5
🎯 Choose your goal

You tell it what to find: full page structure, plain text, tables, or math formulas.

6
Watch it work its magic

The tool scans the image and pulls out all the content in seconds, feeling fast and reliable.

Enjoy perfect results

You get neat, structured text ready to edit, copy, or use anywhere, saving hours of manual work.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 82 to 82 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is MinerU-Diffusion?

MinerU-Diffusion is a diffusion-based framework for document OCR that reframes parsing as inverse rendering, replacing slow autoregressive decoding with block-level parallel diffusion decoding. Feed it page images via Python scripts or Transformers API, and it outputs structured markdown with layout bboxes, raw text, LaTeX formulas, or OTSL tables. Built on a 2.5B vision-language model, it runs fast inference through HF Transformers, SGLang servers, or Nano-DVLM on single GPUs.

Why is it gaining traction?

It delivers up to 3x tokens-per-second speedup over MinerU baselines at near-identical accuracy, thanks to uncertainty-aware remasking and threshold controls for speed-accuracy tradeoffs. Developers dig the end-to-end pipeline—detect layouts, crop blocks, extract content—all in one bash command producing merged markdown. SGLang compatibility enables high-throughput serving without custom servers.

Who should use this?

Document parsing teams handling scanned PDFs, invoices, or academic papers needing layout-aware extraction beyond simple text OCR. Multimodal LLM researchers experimenting with diffusion decoding for vision tasks. Python devs building RAG pipelines over complex docs, frustrated by sequential generation bottlenecks.

Verdict

Promising early diffusion OCR alternative with solid docs, HF demo, and scripts, but 82 stars and 1.0% credibility score reflect its freshness—test for prototypes, hold for production until V2 and training code drop. Grab it if you're prototyping fast, robust document AI.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.