stonesalltheway1

Open source document processing pipeline for the Epstein case files. Download OCR, extract entities, deduplicate and export documents from the DOJ Releases

78
15
100% credibility
Found Feb 19, 2026 at 47 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

Open-source tool for downloading, OCR-processing, entity-linking, and semantically searching public Epstein court documents to power investigative websites.

How It Works

1
🔍 Discover the Epstein Explorer

You find a free tool that turns thousands of public Epstein court files into an easy-to-search treasure trove.

2
📥 Grab the Tool

With one simple click or command, you install the tool on your computer – ready to go in moments.

3
📚 Collect the Files

Tell the tool to download real court documents from trusted public sources like government archives.

4
🧠 Reveal Hidden Details

The magic happens: it reads every scanned page, spots names, dates, flights, and money trails automatically.

5
🔗 Spot Connections

Watch links form between people, places, and events – who flew together, emailed, or appeared in the same files.

6
🔍 Ask and Find

Type questions like 'island trips' or 'bank deals' and get smart matches from your processed files.

Your Truth Hub

Celebrate: you now have a personal searchable database powering your own Epstein research discoveries.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 47 to 78 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is Epstein-Pipeline?

This Python CLI tool is a document processing pipeline tailored for the Epstein case files, pulling 140k+ DOJ PDFs from sources like Kaggle and Hugging Face, then running OCR, entity extraction, deduplication, classification, and embeddings before exporting to JSON, CSV, SQLite, or Neon Postgres with pgvector semantic search. It solves the nightmare of handling scanned legal dumps—multi-backend OCR with confidence fallbacks, three-pass dedup (exact, fuzzy, semantic), and zero-shot NER for persons, dates, and case numbers—powering searchable sites like epsteinexposed.com. Run `epstein-pipeline download doj; ocr ./raw/; dedup ./processed/; export neon` and get a queryable database.

Why is it gaining traction?

In a sea of generic document processing ai and github document pdf tools, it stands out with Epstein-specific tweaks like GLiNER for flight logs and black book names, plus seamless deduplicate across OCR variants and semantic chunking for precise embeddings. Docker Compose and modular pip extras (e.g., [ocr-gpu], [neon]) make it a drop-in document processing sdk for heavy lifts, while CLI search demos pgvector queries instantly—no boilerplate setup.

Who should use this?

OSINT analysts digging DOJ releases or similar case files, journalists building investigative archives, and document processing specialists adapting it for bulk PDF ingestion in legal or compliance workflows. Ideal if you're tired of chaining Unstructured, LlamaParse, or custom scripts for OCR + dedup + vector DB.

Verdict

Grab it for Epstein research or as a battle-tested document github repo template—solid docs, MIT license, and CI pass, despite 44 stars and 1.0% credibility score signaling early maturity. Fork and generalize; it's production-ready for niches but needs more contribs for broad adoption.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.