OxRML

OxRML / MADQA

Public

Multimodal Agentic Document QA benchmark (MADQA)

19
0
100% credibility
Found Mar 16, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

MADQA is a benchmark with 2,250 human-written questions over 800 PDF documents to test AI agents' reasoning on visual and textual document collections, including evaluation tools and baseline implementations.

How It Works

1
🔍 Discover MADQA

You find this helpful tool for testing how well AI assistants understand questions from real PDF documents like reports and manuals.

2
📥 Get the Questions

Download ready-made questions and matching PDF files so you can start testing right away.

3
🤖 Pick an Assistant

Choose from simple search helpers or smart visual readers that scan documents for you.

4
🔗 Connect Your AI

Link a thinking service like ChatGPT or Claude so your assistant can read and reason over pages.

5
▶️ Ask Questions

Type a question and watch your assistant search pages, think step-by-step, and pull out answers with sources.

6
📊 Check Results

Automatically score how accurate the answers are and see exactly which pages were used.

🏆 Compare Performance

View your scores next to top methods on the public leaderboard and improve your document reader.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is MADQA?

MADQA is a Python benchmark for testing multimodal agentic AI systems on document QA, with 2,250 human-authored questions over 800 diverse PDFs. It evaluates how agents reason across pages, retrieve evidence, and generate attributed answers—solving the gap in benchmarks for real-world multimodal RAG and agentic frameworks. Load the dataset from Hugging Face, run baselines like BM25+MLLM agents or managed services, then score predictions via CLI with metrics like accuracy, semantic match, and page F1.

Why is it gaining traction?

It stands out by pitting agents against human performance on strategic navigation vs. stochastic search in doc collections, with a public leaderboard and arXiv paper. Developers dig the ready baselines for github multimodal llm and multimodal agentic rag setups, plus eval tools that handle JSONL outputs from any system. No more hand-rolling metrics for multimodal agentic systems.

Who should use this?

AI researchers benchmarking multimodal agentic models or frameworks, especially for github multimodal rag pipelines handling PDFs. Teams building agentic systems for stock insights or document analysis, like legal tech or finance devs needing page-level citations.

Verdict

Grab it if you're in multimodal agentic AI—solid starting point despite 19 stars and 1.0% credibility score signaling early maturity. Docs are clear with quickstarts, but expect tweaks for production scale.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.