XiangpengHao

Unleash the performance potential of your Parquet files.

43
2
100% credibility
Found Mar 01, 2026 at 43 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Rust
AI Summary

A command-line tool that analyzes Parquet files for performance issues and rewrites them with optimizations like better compression and encoding to reduce size and decoding time.

How It Works

1
🔍 Discover the optimizer

You hear about a handy tool from a blog post that helps make your data files smaller and quicker to use.

2
📥 Get the tool ready

You add the tool to your computer in one quick step so it's ready to help your files.

3
📁 Check your data file

You tell the tool about one of your data files, and it quickly scans it for ways to improve.

4
💡 See smart suggestions

The tool shows you friendly tips on how to pack your data better for speed and space savings, with easy fixes ready to go.

5
✏️ Apply the improvements

You use a simple command to rewrite your file with all the suggested changes automatically.

6
📊 Test the results

Your improved file loads data much faster and takes up less room on disk.

🚀 Enjoy faster data!

Now your projects run quicker with smaller files, and you can even save the recipe for future files.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 43 to 43 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is parquet-linter?

Rust CLI and library to unleash performance in Parquet files by linting for issues like suboptimal compression, encodings, and page sizes. Point it at local files, S3 buckets, or HTTP URLs—it analyzes metadata, estimates cardinality, and outputs diagnostics with a DSL prescription to fix them. Rewrite files directly or apply prescriptions when writing new ones via ArrowWriter integration.

Why is it gaining traction?

Automatic rules catch real-world gotchas like dictionary fallback or missing page stats, with a leaderboard showing 5.7% smaller files and 19.6% faster Arrow decode on Hugging Face datasets. The prescription DSL lets you profile one file and batch-apply to others, plus dry-run and export options make experimentation fast. Remote file support via object stores means no downloads for quick checks.

Who should use this?

Data engineers optimizing Parquet for slow Spark or DuckDB scans on GB-scale datasets. ML teams tuning vector embedding files for faster reads. Anyone writing high-volume Parquet who wants to unleash peak performance without manual tuning.

Verdict

Worth a spin for any Parquet-heavy workflow—CLI delivers immediate wins. At 43 stars and 1.0% credibility, it's early-stage with solid docs and a blog post; production users should validate rewrites first.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.