Skip to main content
Run any Skill in Manus
with one click

debug-training-logs

Debug distributed training failures (NeMo, Megatron, PyTorch) from worker stderr logs and optional AIStore daemon logs. Finds root cause across NCCL timeouts, data loading errors, and storage failures.

Stars17,390
Forks3,436
UpdatedApril 15, 2026 at 20:53
SKILL.md
readonly