Skip to main content
Run any Skill in Manus
with one click

mcore-run-on-slurm

How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.

Stars16,645
Forks4,054
UpdatedMay 29, 2026 at 20:11
File Explorer
5 files
SKILL.md
readonly