run-llama

ParseBench - A Document Parsing Benchmark for AI Agents

171
19
100% credibility
Found Apr 13, 2026 at 44 stars 4x -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

ParseBench is a benchmark for evaluating how well document parsing tools convert PDFs into structured output that AI agents can reliably act on, using ~2,000 human-verified pages from real enterprise documents organized around five capability dimensions.

Star Growth

See how this repo grew from 44 to 171 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is ParseBench?

ParseBench is a Python benchmark for testing document parsing tools that feed PDFs into AI agents. It evaluates ~2,000 real enterprise pages (insurance, finance, government) across five dimensions—tables, charts, content faithfulness, semantic formatting, visual grounding—to ensure parsed output preserves structure for reliable agent decisions. Run it via CLI: `uv run parse-bench run llamaparse_agentic` downloads data, calls APIs, scores deterministically with rules, and spits out HTML reports.

Why is it gaining traction?

Unlike fuzzy LLM judges, ParseBench uses rule-based metrics (169k+ rules) for reproducible results on agent-breaking failure modes like misaligned tables or untraceable elements. It supports 90+ pipelines out-of-box (LlamaParse, OpenAI GPTs, Anthropic, AWS Textract) with leaderboards, side-by-side comparisons (`parse-bench compare`), and interactive dashboards—zero setup beyond API keys. Serve reports locally for PDF previews and drill-downs.

Who should use this?

AI engineers building agents over scanned docs, evaluating parsing providers before production. RAG teams parsing financial reports or forms, needing grounded tables/charts. Document AI devs benchmarking custom VLMs against baselines like Docling or Surya.

Verdict

Grab it for agent parsing evals—CLI and reports are polished, dataset solid on HuggingFace. At 29 stars and 1.0% credibility, it's early (v0.2) with thin community, but arXiv paper and 90% test coverage signal promise; test on small data first.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.