mattc95

Benchmarking AI text detectors (GPTHumanizer, GPTZero, ZeroGPT, Sapling) across multiple datasets to evaluate accuracy, human false positive rates, and risk trade-offs.

17
0
75% credibility
Found May 22, 2026 at 18 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

This is a research project that compares four AI text detection tools by testing them on 1,000 text samples (500 human-written, 500 AI-generated). The benchmark measures how accurately each tool identifies AI content, but also tracks how often each tool wrongly flags real human writing as AI-generated. The project includes detailed results broken down by text length, source type, and AI model, along with full access to the test data and individual results so anyone can verify the findings. The key insight is that catching AI text and protecting human writers are different goals, and some tools excel at one while struggling with the other.

How It Works

1
🔍 You discover AI text detectors

You've heard about tools that claim to tell whether something was written by a human or by AI, and you're curious how well they actually work.

2
📊 You find a comprehensive comparison

You come across a project that tested four different AI detection tools on 1,000 text samples, with half written by humans and half generated by AI.

3
🎯 You see how each tool performs

The benchmark shows you exactly how often each tool correctly identifies AI text and, more importantly, how often it wrongly accuses real human writing of being AI.

4
⚖️ You understand the key trade-off

You learn that catching AI text and protecting human writers are two different goals, and some tools are better at one than the other.

5
You choose your path
🛡️
You care about protecting human writers

You focus on the false positive rates and learn which tool is safest for writers who might be wrongly accused.

🎯
You care about catching AI content

You focus on overall accuracy and AI detection rates to find the most sensitive tool.

📁
You want to verify the results yourself

You download the full test data and outputs to check the findings independently.

6
💡 You learn about text length effects

You discover that shorter texts are harder for all tools to classify, which helps you understand when to trust the results.

You make an informed decision

You now understand the real strengths and weaknesses of AI detectors, so you can use them responsibly or decide when not to rely on them at all.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 18 to 17 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is 2026-AI-DETECTOR-BENCHMARK?

This project benchmarks four AI text detectors -- GPTHumanizer, GPTZero, ZeroGPT, and Sapling -- by running them against the same 1,000 English text samples. Written in Python, it evaluates not just overall accuracy but specifically tracks the human false positive rate, meaning how often real human writing gets wrongly flagged as AI-generated. Results are saved as JSON with confusion matrices, precision/recall metrics, and breakdowns by text length. The full per-item outputs live on Google Drive so anyone can audit the conclusions.

Why is it gaining traction?

The standout feature is the explicit focus on false positive risk. Most detection benchmarks optimize for catching AI text, but this one highlights that wrongly accusing human writers carries serious academic or professional consequences. The repository also makes the data auditable -- AI samples include prompts and model sources, human samples include Pile-small provenance. You can verify results yourself rather than trusting a summary table.

Who should use this?

Developers building apps that integrate AI detection should reference this to understand trade-offs between GPTHumanizer (zero false positives), GPTZero (best overall accuracy), and the others. Educators or administrators making high-stakes decisions about student work can see which tools safest for human writers. Researchers evaluating detector performance across different text lengths will find the stratified breakdowns useful.

Verdict

This is a thoughtful benchmark with auditable methodology, but the 0.75% credibility score and 17 stars reflect an early-stage project with limited community testing. The documentation is solid, the code runs, and the conclusions are verifiable -- but treat the specific detector rankings as one data point, not final verdicts. Detector APIs evolve, and results from May 2026 may not hold tomorrow. Worth bookmarking and rerunning with your own samples before betting on anything high-stakes.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.