hanxiao

IO-aware batched K-Means for Apple Silicon, ported from Flash-KMeans (Triton/CUDA) to pure MLX. Up to 94x faster than sklearn.

10
0
100% credibility
Found Mar 19, 2026 at 10 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A fast batched K-Means clustering library optimized for Apple Silicon using MLX, providing hundreds of times speedup over scikit-learn for large datasets.

How It Works

1
🔍 Discover Fast Clustering

You hear about a speedy tool that groups similar data points super fast on your Apple computer.

2
📦 Add the Tool

You easily add this clustering helper to your Python world with a simple install command.

3
📊 Load Your Data

You prepare your collection of numbers or points that you want to group together.

4
Group Your Data

You tell the tool how many groups you want, and it quickly sorts everything into neat clusters.

5
📈 See the Results

You get back labels for each point showing which group it belongs to, plus the group centers.

6
🏆 Check the Speed

You run a quick test and see how much faster it is than regular methods – wow!

🎉 Lightning-Fast Clusters

Your data is perfectly grouped in seconds, ready for insights or next steps.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 10 to 10 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is flash-kmeans-mlx?

flash-kmeans-mlx delivers IO-aware batched K-Means clustering optimized for Apple Silicon, ported from the Flash-KMeans Triton/CUDA library to pure Python MLX. It handles inputs shaped (B, N, D) for independent batch clustering in one vectorized pass, supporting Euclidean, cosine, and dot metrics via simple functional calls like batch_kmeans_Euclid or a scikit-learn-like FlashKMeans class. Developers get blazing clustering on large point clouds without CUDA dependencies.

Why is it gaining traction?

It crushes sklearn—up to 94x faster on M3 Ultra for 500K points (128D, K=1000) in 77ms vs 40s—thanks to batched execution, custom Metal kernels, and MLX compilation. Benchmarks match H200 GPUs in some cases despite 37x less compute, with easy uv pip install and built-in sklearn comparison scripts. The pure MLX stack means seamless Apple Silicon acceleration without Torch overhead.

Who should use this?

ML engineers on M-series Macs clustering high-dim embeddings (e.g., 70K Fashion-MNIST in 0.12s). Data scientists prototyping K-Means on batched datasets like video frames or text vectors, ditching sklearn's CPU slowness. Apple Silicon users needing cosine/dot metrics for similarity search.

Verdict

Grab it for Apple workloads—benchmarks deliver real speedups, docs include usage and visuals. At 10 stars and 1.0% credibility, it's beta-fresh; test thoroughly before production.

(178 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.