一键在 Manus 中运行任何 Skill

$pwd:

rack-topology

Name: Rack Topology
Author: Azure

// MNNVL domain discovery on Azure GB300 clusters. ClusterUUID lookup via nvidia-smi, expected rack sizes per SKU, FabricManager troubleshooting.

在 Manus 中运行

$ git log --oneline --stat

stars:30

forks:14

updated:2026年3月27日 12:18

SKILL.md

readonly

name	rack-topology
description	MNNVL domain discovery on Azure GB300 clusters. ClusterUUID lookup via nvidia-smi, expected rack sizes per SKU, FabricManager troubleshooting.

Rack Topology

How MNNVL domains work on Azure GB300 clusters, how to discover rack membership, and expected rack structure per SKU.

Scripts: This skill references test scripts from the Azure/ai-infrastructure-on-azure repo. Clone it and run from the repo root.

What Is a Rack / MNNVL Domain?

On GB300 (NDv6) clusters, nodes within a physical rack are connected via NVSwitch/NVLink in an MNNVL (Multi-Node NVLink) domain. This gives intra-rack bandwidth of ~900+ GB/s for allreduce operations — far higher than the ~200 GB/s available over InfiniBand between racks.

Each MNNVL domain has a unique ClusterUUID reported by nvidia-smi. All nodes sharing the same ClusterUUID are in the same physical rack and can use NVLink for communication.

Rack Structure by SKU

GB300 (Standard_ND128isr_GB300_v6)

18 nodes per rack (72 GPUs per MNNVL domain)
4 GPUs per node
Nodes within a rack communicate via NVLink/NVSwitch/MNNVL
Nodes across racks communicate via InfiniBand NDR 400 Gb/s
ClusterUUID is a valid UUID (e.g., a1b2c3d4-e5f6-7890-abcd-ef1234567890)

H100 (Standard_ND96isr_H100_v5)

No MNNVL — NVSwitch is intra-node only (8 GPUs within one node)
8 GPUs per node
All inter-node communication is via InfiniBand
ClusterUUID may not be present or meaningful
Rack topology is less relevant for NCCL testing (no intra-rack NVLink advantage)

Discovering Rack Topology

Single node query

nvidia-smi -q | grep ClusterUUID

Output:

    ClusterUUID                       : a1b2c3d4-e5f6-7890-abcd-ef1234567890

Fleet-wide discovery with parallel-ssh

# From the scheduler node
parallel-ssh -H "ccw-gpu-1 ccw-gpu-2 ccw-gpu-3 ..." -t 15 -i \
  "nvidia-smi -q 2>/dev/null | grep 'ClusterUUID' | head -1 | awk -F': ' '{print \$2}'"

Output:

[1] 14:23:45 [SUCCESS] ccw-gpu-1
a1b2c3d4-e5f6-7890-abcd-ef1234567890
[2] 14:23:45 [SUCCESS] ccw-gpu-2
a1b2c3d4-e5f6-7890-abcd-ef1234567890
[3] 14:23:46 [SUCCESS] ccw-gpu-19
b2c3d4e5-f6a7-8901-bcde-f12345678901

Group nodes by UUID to get rack membership.

Programmatic discovery

Using Slurm hostlist expansion and parallel SSH:

# Get all nodes in the GPU partition
NODES=$(sinfo -p gpu -h -N -o '%N' | sort -u | tr '\n' ' ')

# Query ClusterUUID from all nodes
parallel-ssh -H "$NODES" -t 15 -i \
  "nvidia-smi -q 2>/dev/null | grep 'ClusterUUID' | head -1 | awk -F': ' '{print \$2}'"

Handling edge cases

Drained/down nodes: Skip them — they can't be queried. Clear any cached rack_id.
ClusterUUID = N/A or all zeros: NVLink fabric not initialized. This is a hardware issue — file GHR with category nvlink_down.
Node missing from output: SSH failed — node may be unresponsive.

Validating Rack Size

After discovery, verify each rack has the expected number of nodes:

SKU	Expected Rack Size
GB300 (NDv6)	18 nodes

If a rack has fewer than expected nodes:

Check if the missing nodes are drained/down (expected — they were filtered out).
If nodes are in idle or allocated state but didn't return a ClusterUUID, investigate those nodes.

Using Rack Topology for Testing

Per-rack NCCL tests (MNNVL)

Test each rack independently to validate intra-rack NVLink bandwidth:

# For each rack, run NCCL test on its nodes
./nccl_test.sh --sku graceblackwell -N 18 -w ccw-gpu-[1-18]

Expected busbw: ~937 GB/s at 16 G message size.

Inter-rack NCCL tests (IB-only)

Pick one node from each rack and test across racks:

# One node per rack, testing IB fabric
./nccl_test.sh --sku graceblackwell -N 4 -w ccw-gpu-1,ccw-gpu-19,ccw-gpu-37,ccw-gpu-55

Use IB-only NCCL settings (disable MNNVL) for pure IB measurement.

Rack-aware training node selection

For training jobs, prefer allocating full racks (or multiples of racks) to maximize MNNVL utilization. Incomplete rack allocation wastes NVLink bandwidth and forces more traffic over IB.

FabricManager

NVLink/MNNVL requires NVIDIA FabricManager to be running:

systemctl status nvidia-fabricmanager

Healthy output includes Active: active (running).

Common FabricManager issues:

"training in progress" with ClusterUUID all zeros → NVLink fabric failed to initialize. GHR category: nvlink_down.
"FabricManager not running" → Service crashed or failed to start. Try sudo systemctl restart nvidia-fabricmanager. If it won't start, GHR.
DCGM NVSwitch errors → dcgmi discovery -l | grep -i nvswitch to check NVSwitch visibility.

related-skills.json

同仓库

azure-node-health-report.md

from "Azure/ai-infrastructure-on-azure"

File Azure Guest Health Reports for node investigation or replacement. Complete impact category reference (26 categories), PhysicalHostName and Resource ID collection, REST API format, and insight polling.

2026-03-2730

cluster-outlier-detection.md

from "Azure/ai-infrastructure-on-azure"

Statistical methods for identifying underperforming nodes from batch test results. Absolute thresholds, z-score, and MAD methods for fleet-wide GPU and NCCL analysis.

2026-03-2730

ib-link-validation.md

from "Azure/ai-infrastructure-on-azure"

Check InfiniBand connectivity, port state, partition keys, and error counters on Azure HPC nodes. Covers operstate, ibstat, pkey verification, link flap detection, and soft fixes.

2026-03-2730

nccl-allreduce-test.md

from "Azure/ai-infrastructure-on-azure"

Run NCCL all_reduce_perf bandwidth tests via Slurm, configure per-SKU environment variables (MNNVL, SHARP, GDR), and interpret busbw results.

2026-03-2730

nccl-performance-diagnosis.md

from "Azure/ai-infrastructure-on-azure"

Analyze NCCL bandwidth results, scope intra-rack vs inter-rack failures, and use bisection algorithm to isolate bad nodes. GPU vs network root cause analysis.

2026-03-2730

node-drain-and-replace.md

from "Azure/ai-infrastructure-on-azure"

Slurm node lifecycle management — drain, undrain, reboot, and file for replacement. Decision tree for when to drain vs reboot vs GHR.

2026-03-2730

package.json

"author": "Azure"

"repository": "Azure/ai-infrastructure-on-azure"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

网络与计算机系统管理员计算机与数学类职业15-1244L4

name	rack-topology
description	MNNVL domain discovery on Azure GB300 clusters. ClusterUUID lookup via nvidia-smi, expected rack sizes per SKU, FabricManager troubleshooting.

Rack Topology

How MNNVL domains work on Azure GB300 clusters, how to discover rack membership, and expected rack structure per SKU.

Scripts: This skill references test scripts from the Azure/ai-infrastructure-on-azure repo. Clone it and run from the repo root.

What Is a Rack / MNNVL Domain?

Each MNNVL domain has a unique ClusterUUID reported by nvidia-smi. All nodes sharing the same ClusterUUID are in the same physical rack and can use NVLink for communication.

Rack Structure by SKU

GB300 (Standard_ND128isr_GB300_v6)

18 nodes per rack (72 GPUs per MNNVL domain)
4 GPUs per node
Nodes within a rack communicate via NVLink/NVSwitch/MNNVL
Nodes across racks communicate via InfiniBand NDR 400 Gb/s
ClusterUUID is a valid UUID (e.g., a1b2c3d4-e5f6-7890-abcd-ef1234567890)

H100 (Standard_ND96isr_H100_v5)

No MNNVL — NVSwitch is intra-node only (8 GPUs within one node)
8 GPUs per node
All inter-node communication is via InfiniBand
ClusterUUID may not be present or meaningful
Rack topology is less relevant for NCCL testing (no intra-rack NVLink advantage)

Discovering Rack Topology

Single node query

nvidia-smi -q | grep ClusterUUID

Output:

    ClusterUUID                       : a1b2c3d4-e5f6-7890-abcd-ef1234567890

Fleet-wide discovery with parallel-ssh

# From the scheduler node
parallel-ssh -H "ccw-gpu-1 ccw-gpu-2 ccw-gpu-3 ..." -t 15 -i \
  "nvidia-smi -q 2>/dev/null | grep 'ClusterUUID' | head -1 | awk -F': ' '{print \$2}'"

Output:

[1] 14:23:45 [SUCCESS] ccw-gpu-1
a1b2c3d4-e5f6-7890-abcd-ef1234567890
[2] 14:23:45 [SUCCESS] ccw-gpu-2
a1b2c3d4-e5f6-7890-abcd-ef1234567890
[3] 14:23:46 [SUCCESS] ccw-gpu-19
b2c3d4e5-f6a7-8901-bcde-f12345678901

Group nodes by UUID to get rack membership.

Programmatic discovery

Using Slurm hostlist expansion and parallel SSH:

# Get all nodes in the GPU partition
NODES=$(sinfo -p gpu -h -N -o '%N' | sort -u | tr '\n' ' ')

# Query ClusterUUID from all nodes
parallel-ssh -H "$NODES" -t 15 -i \
  "nvidia-smi -q 2>/dev/null | grep 'ClusterUUID' | head -1 | awk -F': ' '{print \$2}'"

Handling edge cases

Drained/down nodes: Skip them — they can't be queried. Clear any cached rack_id.
ClusterUUID = N/A or all zeros: NVLink fabric not initialized. This is a hardware issue — file GHR with category nvlink_down.
Node missing from output: SSH failed — node may be unresponsive.

Validating Rack Size

After discovery, verify each rack has the expected number of nodes:

SKU	Expected Rack Size
GB300 (NDv6)	18 nodes

If a rack has fewer than expected nodes:

Check if the missing nodes are drained/down (expected — they were filtered out).
If nodes are in idle or allocated state but didn't return a ClusterUUID, investigate those nodes.

Using Rack Topology for Testing

Per-rack NCCL tests (MNNVL)

Test each rack independently to validate intra-rack NVLink bandwidth:

# For each rack, run NCCL test on its nodes
./nccl_test.sh --sku graceblackwell -N 18 -w ccw-gpu-[1-18]

Expected busbw: ~937 GB/s at 16 G message size.

Inter-rack NCCL tests (IB-only)

Pick one node from each rack and test across racks:

# One node per rack, testing IB fabric
./nccl_test.sh --sku graceblackwell -N 4 -w ccw-gpu-1,ccw-gpu-19,ccw-gpu-37,ccw-gpu-55

Use IB-only NCCL settings (disable MNNVL) for pure IB measurement.

Rack-aware training node selection

For training jobs, prefer allocating full racks (or multiples of racks) to maximize MNNVL utilization. Incomplete rack allocation wastes NVLink bandwidth and forces more traffic over IB.

FabricManager

NVLink/MNNVL requires NVIDIA FabricManager to be running:

systemctl status nvidia-fabricmanager

Healthy output includes Active: active (running).

Common FabricManager issues:

"training in progress" with ClusterUUID all zeros → NVLink fabric failed to initialize. GHR category: nvlink_down.
"FabricManager not running" → Service crashed or failed to start. Try sudo systemctl restart nvidia-fabricmanager. If it won't start, GHR.
DCGM NVSwitch errors → dcgmi discovery -l | grep -i nvswitch to check NVSwitch visibility.

rack-topology

Rack Topology

What Is a Rack / MNNVL Domain?

Rack Structure by SKU

GB300 (Standard_ND128isr_GB300_v6)

H100 (Standard_ND96isr_H100_v5)

Discovering Rack Topology

Single node query

Fleet-wide discovery with parallel-ssh

Programmatic discovery

Handling edge cases

Validating Rack Size

Using Rack Topology for Testing

Per-rack NCCL tests (MNNVL)

Inter-rack NCCL tests (IB-only)

Rack-aware training node selection

FabricManager

同仓库更多 Skills

同仓库更多 Skills

Rack Topology

What Is a Rack / MNNVL Domain?

Rack Structure by SKU

GB300 (Standard_ND128isr_GB300_v6)

H100 (Standard_ND96isr_H100_v5)

Discovering Rack Topology

Single node query

Fleet-wide discovery with parallel-ssh

Programmatic discovery

Handling edge cases

Validating Rack Size

Using Rack Topology for Testing

Per-rack NCCL tests (MNNVL)

Inter-rack NCCL tests (IB-only)

Rack-aware training node selection

FabricManager