notoriouslab

doc-cleaner:一個為繁體中文金融文件設計的開源文件清洗工具,支援完全離線運行,你的文件,不該為了整理而離開你的電腦 :)

77
13
100% credibility
Found Mar 13, 2026 at 68 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A command-line tool that converts PDF, DOCX, XLSX, text, and CSV files into clean, structured Markdown, with strong support for CJK languages, tables, and optional local or cloud AI enhancement for privacy and accuracy.

How It Works

1
📄 Gather your messy documents

You collect PDFs, Word files, spreadsheets, or text files like bank statements that are hard to read due to tables and fine print.

2
💻 Download the cleaner tool

Get this simple program from the web to turn those documents into neat, easy-to-read notes.

3
🛠️ Prepare it on your computer

Follow quick setup steps so everything is ready to use without hassle.

4
Pick your cleaning style
Basic mode

Quick extraction of text and tables, fully private and fast.

🧠
Local smart help

Use your own computer's power to structure content perfectly, nothing leaves your machine.

☁️
Online smart help

Tap into powerful online brains for the trickiest documents.

5
📁 Select files to clean

Choose a folder or specific files full of your documents.

6
Run the magic cleanup

Start the process and see your jumbled pages become clean, organized notes with tables preserved.

7
👁️ Preview and save

Peek at the results first or save them directly to a new folder.

🎉 Enjoy perfect notes

You now have structured, readable Markdown files ready for reviewing finances, sharing, or using anywhere.

Sign up to see the full architecture

6 more

Sign Up Free

Star Growth

See how this repo grew from 68 to 77 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is doc-cleaner?

doc-cleaner is a Python CLI tool that converts PDFs, DOCX, XLSX, and text files into clean, structured Markdown, optimized for Traditional Chinese financial documents like bank statements. It preserves tables as pipe tables, auto-detects CJK encodings, and smartly triages PDFs—fast text extraction for native files, vision mode for scanned ones—while supporting fully offline runs with no cloud uploads. Run `python cleaner.py --input statement.pdf --ai none` for instant privacy-first cleaning, unlike sketchy doc cleaners online.

Why is it gaining traction?

It beats generic doc cleaners by truncating bank ad boilerplate via regex patterns, outputting YAML frontmatter and JSON summaries for scripts, and integrating local Ollama (qwen3.5 models) or Gemini for AI structuring—all configurable in JSON. Dry-run previews and atomic writes prevent mishaps, and it slots into AI agent pipelines like OpenClaw. Not for Doc Marten shoe cleaner needs or Doc's Cleaners reviews in East Point, Georgia, but devs dig the table fidelity and zero-leak local mode.

Who should use this?

Taiwan finance devs parsing bank PDFs with disclaimers and crushed tables, AI agent builders chaining doc parsing to analysis, personal finance scripters automating Gmail statements to Markdown. Ideal if you're dodging cloud doc marten suede cleaner risks or patent cleaner uploads for sensitive docs.

Verdict

Early with 43 stars and 1.0% credibility score, but bilingual READMEs, exit codes, and MIT license make it production-ready for niches—test coverage implied by fallbacks. Grab for offline doc cleaning; tweak ad patterns and ship.

(178 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.