Skip to main content
Manus에서 모든 스킬 실행
원클릭으로

run-on-slurm

How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.

스타0
포크0
업데이트2026년 5월 2일 07:39
SKILL.md
readonly