apache

Java bindings for Apache DataFusion

13
6
100% credibility
Found May 18, 2026 at 15 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Java
AI Summary

Apache DataFusion Java is an official Java library from the Apache Software Foundation that lets you run SQL queries and data transformations directly on files like CSV, Parquet, JSON, and Arrow format. Instead of loading data into a database first, you connect your Java application to a powerful query engine that processes your files in place at high speed. The library provides both SQL query capabilities and a DataFrame API for transforming data, with results returned efficiently using Apache Arrow. You can also register custom Java functions to extend the query engine with your own calculations.

How It Works

1
🔍 Discovering the need for fast data queries

You have large data files sitting on your computer and need to run SQL queries on them without the hassle of moving everything into a database first.

2
📦 Finding the Java bindings

You discover there's an official Java library that connects your Java code directly to a powerful query engine that processes your files at high speed.

3
🚀 Creating your query workspace

You set up a SessionContext - your personal workspace where all your data lives and gets queried. Everything is configured and ready to go with just a few lines of code.

4
📂 Loading your data files

You point the library at your files - whether they're CSV spreadsheets, compressed Parquet reports, or JSON logs. The library reads them directly without any extra steps.

5
Choosing how to query
📝
Writing SQL queries

You write familiar SQL statements like SELECT, GROUP BY, and JOIN directly against your files.

🔗
Using the DataFrame API

You chain together operations like filter, select, and limit to transform your data step by step.

6
Getting lightning-fast results

The engine processes your query at high speed and streams the results back to your Java program as efficient Arrow data batches.

🎉 Saving and sharing your results

You can write your processed results back to Parquet files or other formats, ready to share with your team or feed into the next step of your pipeline.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 15 to 13 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is datafusion-java?

Apache DataFusion Java is a set of Java bindings that let you run SQL queries and DataFrame operations against files like Parquet, CSV, and JSON. The queries execute in native Rust code under the hood, with results streamed back to the JVM as Apache Arrow batches. You write standard Java, call familiar APIs like SessionContext and DataFrame, and DataFusion handles the heavy lifting in a performant Rust runtime.

Why is it gaining traction?

This project fills a gap for teams locked in the Java ecosystem who want DataFusion's modern query engine without rewriting everything in Python or Rust. The Arrow C Data Interface integration means zero-copy data transfer between native memory and the JVM, which is rare in Java bindings for native libraries. You get full SQL support, custom user-defined functions written in Java, and access to DataFusion's optimizer without leaving your existing stack.

Who should use this?

Java backend teams processing large analytical workloads on Parquet or CSV files who want SQL without deploying a separate query service. Data engineers building ETL pipelines in Java who need the performance of a columnar execution engine. Teams already using Apache Arrow who want to add flexible query capabilities without a database dependency.

Verdict

This is a legitimate Apache project with clean APIs and solid architectural choices, but with only 13 stars and no releases yet, expect rough edges and API changes. The 1.0% credibility score reflects early-stage status, not a quality concern. Worth evaluating for greenfield Java data projects, but hold off on production dependencies until the first release.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.