paulgp

Build a harmonized DuckDB database from 20+ years of IPEDS higher education data

14
0
100% credibility
Found Mar 03, 2026 at 14 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A script that downloads, cleans, and combines 25+ years of public U.S. postsecondary education data from IPEDS into a single queryable database file.

How It Works

1
🔍 Discover College Data Treasure

You hear about a free collection of detailed info on every U.S. college, from admissions to tuition over 25 years.

2
📥 Grab the Files

Download the simple project files to your computer to get started.

3
🛠️ Set Up Your Tools

Install the free basic tools it needs, like a quick data organizer.

4
Build Your Database

Hit one button to download and organize all the college data into a ready-to-use file—takes a bit but feels magical when done.

5
📊 Start Exploring

Open the data file and ask questions like 'Which colleges have the lowest admission rates?' using simple search commands.

6
💡 Try Ready Examples

Run the included sample questions to see trends in degrees awarded or tuition changes over time.

🎉 Unlock College Insights

Now you have a powerful personal dataset to compare schools, spot trends, and make informed decisions about education.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 14 to 14 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is ipeds-database?

This Python project builds a harmonized DuckDB database from 20+ years of IPEDS higher education data with one command: `uv run python build_database.py`. It downloads ~720MB of raw files, cleans schema changes across years, handles missing values, and outputs a 1.1GB file with 26 million rows across 20 tables covering U.S. colleges from 1997-2024—think admissions, enrollment by race/age, completions by CIP code, tuition, graduation rates, and staff salaries. Query it instantly via SQL in any DuckDB client or Python for fast analytics on postsecondary trends.

Why is it gaining traction?

IPEDS raw data is a mess—hundreds of CSVs with shifting column names, race category overhauls, and cryptic missings—but this turns it into a joinable, SQL-ready database with built-in views like admission rates and tuition trends, plus example queries and R plots. Devs love the zero-setup ETL via GitHub Actions-friendly script, caching downloads for quick rebuilds of subsets, making it ideal to build GitHub projects or portfolios demoing education data viz. No more manual harmonization; just query MBA degrees over time or HBCU enrollment shifts.

Who should use this?

Higher ed researchers analyzing graduation rates or tuition trends; policy analysts at think tanks tracking institutional shifts; edtech devs building apps with college stats like SAT yields or staff demographics. Great for economists studying completions by CIP code or journalists plotting in-state enrollment.

Verdict

Solid niche tool for IPEDS work—excellent docs, examples, and MIT license—but low maturity with 14 stars and 1.0% credibility score means test it for your years/tables first. Use if you need this data; otherwise, skip for broader datasets.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.