AMD-AGI / maxtext-slurm

Public

Toolkit for launching and observing MaxText training on Slurm-managed GPU clusters

100% credibility

Found Feb 20, 2026 at 18 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Shell

AI Summary

Toolkit for easily launching and monitoring large language model training jobs on AMD GPU clusters using Slurm, with built-in dashboards and analysis tools.

How It Works

🔍 Discover the toolkit

You hear about a helpful set of tools for training large AI language models on powerful computer clusters.

📥 Get the tools

You download the tools to your computer cluster and try a simple example on one machine to see it work.

🚀 Start a big training job

With one easy command, you launch training for a huge model like Llama across many computers—it handles everything automatically.

📊 Watch live updates

Open dashboards to see training progress, speeds, temperatures, and health checks in real time—no setup needed.

🔍 Check the results

After it finishes, review charts, logs, and smart analysis to understand performance and spot improvements.

🎉 Your model is ready

You now have a trained AI model with full insights into how it performed, ready for your next project.

Sign up to see the full architecture

4 more

Star Growth

See how this repo grew from 18 to 18 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is maxtext-slurm?

Shell-based ai toolkit github for launching and observing MaxText training on slurm-managed gpu clusters. Drop it on a shared filesystem and run `submit.sh 70b -N 1` to spin up distributed JAX LLM training (Llama2-70B, Mixtral, Grok-1) with automatic container pulls, multi-node coordination, and model configs. `RAY=1` adds live dashboards for TensorBoard, Ray, and Prometheus TSDB without perf overhead.

Why is it gaining traction?

Unlike raw Slurm scripts, it bundles zero-setup observability—GPU metrics, network diagnostics, host health—all queryable post-run via Prometheus. Telegram bots alert on hangs; AI skills guide Cursor/Claude for perf triage or failure diagnosis. Layered design swaps schedulers or runtimes independently, saving weeks on custom gpu training pipelines.

Who should use this?

ML engineers on Slurm AMD gpu clusters firing up maxtext runs for models like Llama3.1-405B or DeepSeek proxies. Suited for teams iterating on multi-node training who want instant monitoring and debugging without bolting on tools. Local mode tests configs solo before scaling.

Verdict

Practical reference toolkit for slurm gpu training—thorough docs and one-liners get you running fast. 18 stars and 1.0% credibility score signal early maturity (no tests visible), but MIT license and extensibility make it a strong fork candidate for production clusters.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

211

Followers

Base stars: 18 stars

Bonus: AI verified quality (100%)

Account age: 728 days

Repo age: 18 days

License: MIT

Updated: Mar 01, 2026