apache

Apache Paimon Mosaic: a columnar-bucket hybrid format optimized for wide tables.

10
6
100% credibility
Found May 22, 2026 at 10 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Rust
AI Summary

Apache Paimon Mosaic is a file format designed for storing wide tables with many columns efficiently. It combines two storage approaches - columnar (good for reading specific columns) and bucket-based (good for parallel processing) - to optimize both read and write performance. The library provides bindings for Python, Java, and C++, allowing users to write data batches that automatically get compressed, organized into buckets, and tracked with statistics. When reading, users can access specific columns or row groups without loading everything, and the built-in statistics help query engines skip irrelevant data. This is an official Apache Software Foundation project.

How It Works

1
📊 You have a wide table with many columns

You have a dataset with lots of columns that needs to be stored efficiently for both reading and writing.

2
📦 You install the library for your language

You pick your preferred language - Python, Java, or C++ - and install the Mosaic library to work with your data.

3
✍️ You create a writer and define your data structure

You set up a writer with your table's structure, choose how many buckets to use, and optionally tell it which columns to track statistics for.

4
🔄 Your data gets organized automatically

As you write batches of data, the library automatically compresses everything, splits it into buckets, and keeps track of min/max values for each column.

5
📁 You close the writer and your file is ready

When you're done writing, you close the writer and get a complete file with all the metadata and statistics built in.

6
🔍 You open the file and read what you need

Later, you can open the file and read specific columns or row groups without loading everything. The statistics help query engines skip data that doesn't match.

Your wide table is stored and accessed efficiently

Your data is organized for fast reads and writes, with automatic compression and helpful statistics for query optimization.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 10 to 10 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is paimon-mosaic?

Apache Paimon Mosaic is a Rust-based file format that solves the wide-table problem in data lakes. It stores hundreds or thousands of columns efficiently by combining columnar organization with bucket-based partitioning. The format automatically picks the best encoding for each column: constant values stay constant, repeated values use dictionary encoding, and everything gets compressed with ZSTD. Data reads produce Arrow arrays directly, making it natural to plug into analytical workloads. The project ships with Java and Python bindings, so you can use it from JVM languages or Python without rewriting your data pipelines.

Why is it gaining traction?

Wide tables are everywhere in IoT, event logging, and financial data, but most formats choke on them. Mosaic takes a hybrid approach that other lakehouse formats ignore: columns land in buckets based on sorted names, which groups related data together and speeds up selective reads. The format is self-describing with embedded statistics, so query engines can skip entire row groups without reading a single byte. Being part of the Apache Paimon ecosystem means it slots into existing Flink and Spark integrations rather than requiring a full infrastructure swap.

Who should use this?

Data engineers building real-time analytics pipelines on wide, sparse datasets should watch this. Teams already running Apache Flink or Spark with Paimon will get the most value since Mosaic is designed as a storage layer for that stack. If you're storing sensor telemetry, audit logs, or any table with hundreds of rarely-queried columns, the bucket-based layout and aggressive compression could cut storage costs noticeably. Python and Java users who need high-performance columnar I/O without leaving their runtime will find the FFI bindings useful.

Verdict

At 10 stars, this is extremely early-stage software with a 1.0% credibility score. The Apache governance and Rust implementation suggest long-term viability, but documentation is thin and the learning curve is steep without examples. Worth evaluating for specific wide-table use cases, but not ready for production without thorough testing.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.