Apache DataFusion Java is an official Java library from the Apache Software Foundation that lets you run SQL queries and data transformations directly on files like CSV, Parquet, JSON, and Arrow format. Instead of loading data into a database first, you connect your Java application to a powerful query engine that processes your files in place at high speed. The library provides both SQL query capabilities and a DataFrame API for transforming data, with results returned efficiently using Apache Arrow. You can also register custom Java functions to extend the query engine with your own calculations.
How It Works
You have large data files sitting on your computer and need to run SQL queries on them without the hassle of moving everything into a database first.
You discover there's an official Java library that connects your Java code directly to a powerful query engine that processes your files at high speed.
You set up a SessionContext - your personal workspace where all your data lives and gets queried. Everything is configured and ready to go with just a few lines of code.
You point the library at your files - whether they're CSV spreadsheets, compressed Parquet reports, or JSON logs. The library reads them directly without any extra steps.
You write familiar SQL statements like SELECT, GROUP BY, and JOIN directly against your files.
You chain together operations like filter, select, and limit to transform your data step by step.
The engine processes your query at high speed and streams the results back to your Java program as efficient Arrow data batches.
You can write your processed results back to Parquet files or other formats, ready to share with your team or feed into the next step of your pipeline.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.