apache / paimon-mosaic
PublicApache Paimon Mosaic: a columnar-bucket hybrid format optimized for wide tables.
Apache Paimon Mosaic is a file format designed for storing wide tables with many columns efficiently. It combines two storage approaches - columnar (good for reading specific columns) and bucket-based (good for parallel processing) - to optimize both read and write performance. The library provides bindings for Python, Java, and C++, allowing users to write data batches that automatically get compressed, organized into buckets, and tracked with statistics. When reading, users can access specific columns or row groups without loading everything, and the built-in statistics help query engines skip irrelevant data. This is an official Apache Software Foundation project.
How It Works
You have a dataset with lots of columns that needs to be stored efficiently for both reading and writing.
You pick your preferred language - Python, Java, or C++ - and install the Mosaic library to work with your data.
You set up a writer with your table's structure, choose how many buckets to use, and optionally tell it which columns to track statistics for.
As you write batches of data, the library automatically compresses everything, splits it into buckets, and keeps track of min/max values for each column.
When you're done writing, you close the writer and get a complete file with all the metadata and statistics built in.
Later, you can open the file and read specific columns or row groups without loading everything. The statistics help query engines skip data that doesn't match.
Your data is organized for fast reads and writes, with automatic compression and helpful statistics for query optimization.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.