Skip to main content
在 Manus 中运行任何 Skill
一键导入

debug-training-logs

Debug distributed training failures (NeMo, Megatron, PyTorch) from worker stderr logs and optional AIStore daemon logs. Finds root cause across NCCL timeouts, data loading errors, and storage failures.

星标17,390
分支3,436
更新时间2026年4月15日 20:53
SKILL.md
readonly