digitalcortex / 72m-domains-dataset

Public

Dataset with unique registered domains extracted from Common Crawl's columnar index (cc-index).

datasets domains scraping webcrawler

100% credibility

Found Mar 06, 2026 at 14 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

AI Summary

A dataset of 72 million unique registered domains extracted from Common Crawl indexes across 14 crawls, intended as a starting point for web crawlers and research.

How It Works

🔍 Discover the domain list

You hear about a massive collection of real website names pulled from huge web scans, perfect for starting web research or explorers.

📱 Check out the page

You visit the simple project page to see what it's all about.

📅 See the coverage

You discover it gathers unique domains from 14 web scans spanning over a year.

💾 Download the dataset

You grab the handy file packed with 72 million unique website domains.

📊 Open in your tool

You load the list into your spreadsheet or notes app to start browsing.

🚀 Kick off your project

With this giant seed list, your web research or crawler project springs to life.

🎉 Web insights unlocked

You now explore millions of real sites, fueling your discoveries with ease.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 14 to 14 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is 72m-domains-dataset?

This GitHub dataset repo packs 72,475,235 unique registered domains extracted from Common Crawl's cc-index columnar Parquet files across 14 recent crawls, like CC-MAIN-2025-05 to CC-MAIN-2026-08. Developers grab a deduplicated list via github dataset download in Parquet or SQLite formats, ideal as a seed frontier for web crawlers or search engines chasing fresh domains. It's a no-fuss github dataset bigquery-ready or python-loadable resource for domain discovery.

Why is it gaining traction?

Unlike scattered kaggle unique datasets or partial common crawl dumps, this delivers 72m unique values from cc-index snapshots in one clean file, skipping manual Parquet scans. Hooks like pyarrow dataset unique support and easy serialize to CSV make it a quick win for huggingface dataset unique values pipelines or ray dataset unique filtering. Low stars (14) aside, the scale and recency stand out for anyone needing columnar common crawl intel without rebuilding from scratch.

Who should use this?

Web crawler engineers seeding discovery queues, search engine devs bootstrapping indexes, or SEO researchers analyzing domain trends. Python scripters loading via pandas for bigquery exports, or power bi users adding dataset unique identifier columns for viz. Skip if you're after full-page content, not just hosts.

Verdict

Grab it if you need a massive, deduped domain seed from fresh cc-index—docs are basic (README only, no health.json depth) but functional for github dataset huggingface flows. With 1.0% credibility score and 14 stars, it's early-stage raw data, not polished library; test small before scaling.

(178 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

Followers

Base stars: 14 stars

Penalty: Very new repo (2d): -70%

Bonus: AI verified quality (100%)

Account age: 2,173 days

Repo age: 3 days

License: NOASSERTION

Updated: Mar 06, 2026