ワンクリックで
onboard-gb200-1node-tests
// Onboard 1-node GitHub MR functional tests for GB200 from existing mr-scoped 2-node tests.
// Onboard 1-node GitHub MR functional tests for GB200 from existing mr-scoped 2-node tests.
Investigate a failing GitHub Actions run or job and create a GitHub issue for the failure.
Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs
Refresh golden values from a GitHub Actions workflow run (failing-only or all jobs), score the change with average normalized relative differences, and produce a PR-ready summary. Use when the user asks to update goldens for a CI run, refresh golden values from a workflow ID, or generate a golden-value diff summary for a PR description.
Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.
Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.
CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.
| name | onboard-gb200-1node-tests |
| description | Onboard 1-node GitHub MR functional tests for GB200 from existing mr-scoped 2-node tests. |
| when_to_use | Adding GB200 github-mr tests; creating single-node variants of existing tests; expanding CI coverage for GB200; 'add GB200 MR tests', 'onboard GB200 1-node', 'create single-node variant'. |
| user_invocable | true |
| argument | [model-yaml] # optional: gpt, moe, or both (default: both) |
Create 1-node (mr-github) variants of existing 2-node (mr-scoped) GB200 functional tests.
Each GB200 node has 4 GPUs. A 2-node test uses 8 GPUs total; the 1-node variant uses 4.
GB200 functional tests live in tests/test_utils/recipes/gb200/:
| Recipe file | Notes |
|---|---|
gpt.yaml | GPT dense tests, nodes: 2, gpus: 4 (8 total) |
moe.yaml | MoE tests, nodes: 2, gpus: 4 (8 total) |
moe-1node.yaml | Existing 1-node MoE tests, nodes: 1, gpus: 4 (4 total) |
gpt-1node.yaml | 1-node GPT tests (create if not present) |
Model configs live at:
tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml
1-node test cases use the _1node suffix:
tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml
Scan the products: block in gpt.yaml and moe.yaml for entries with scope: [mr, ...] or scope: [mr-slim, ...]. These are the 2-node tests that need 1-node mr-github counterparts.
Ignore tests already covered in *-1node.yaml files, and ignore nightly, weekly, mr-broken scopes.
For each candidate, read its model_config.yaml and extract the key parallelism arguments:
--tensor-model-parallel-size (TP)
--pipeline-model-parallel-size (PP)
--expert-model-parallel-size (EP)
--expert-tensor-parallel-size (ETP)
--context-parallel-size (CP)
--global-batch-size
--micro-batch-size
The world size formula is: world_size = TP × PP × DP where DP ≥ EP.
Going from 8 GPUs → 4 GPUs:
| Condition | Action |
|---|---|
TP × PP ≤ 4 | Trivial copy. Config unchanged; DP is halved automatically. |
TP × PP = 8 (e.g. tp4 pp2) | Reduce PP. Set PP = PP / 2 (e.g. pp2→1). Verify TP × PP_new ≤ 4. |
EP > 4 (e.g. ep8 with tp1 pp1) | Reduce EP. Set EP = 4. Experts stay at num-experts (each EP rank holds more experts). |
EP > 4 and TP × PP > 4 | Reduce both PP and EP as above. |
| ETP test (ep × etp ≤ TP × DP) | Check EP × ETP ≤ TP × DP_new after PP reduction. Usually satisfied when pp→1. |
Do not change GBS — let gradient accumulation absorb the reduced DP.
_1node model config directories# Trivial copy
mkdir -p tests/functional_tests/test_cases/{model}/{test_case}_1node
cp tests/functional_tests/test_cases/{model}/{test_case}/model_config.yaml \
tests/functional_tests/test_cases/{model}/{test_case}_1node/model_config.yaml
# Then apply any parallelism changes (EP or PP) with Edit tool
For GPT tests — create tests/test_utils/recipes/gb200/gpt-1node.yaml (if absent) by cloning gpt.yaml's spec block with nodes: 1. Use this template for the spec:
type: basic
format_version: 1
maintainers: [mcore]
loggers: [stdout]
spec:
name: "{test_case}_{environment}_{platforms}"
model: gpt # or moe
build: mcore-pyt-{environment}
nodes: 1
gpus: 4
n_repeat: 5
platforms: dgx_gb200
script_setup: | # copy verbatim from gpt.yaml / moe.yaml
...
script: |- # copy verbatim from gpt.yaml / moe.yaml
...
For MoE tests — append entries to the existing moe-1node.yaml.
Scope convention:
scope: [mr-github, mr-github-slim]scope: [mr-github]products:
- test_case: [<test_case>_1node]
products:
- environment: [dev]
scope: [mr-github, mr-github-slim] # or [mr-github]
platforms: [dgx_gb200]
| Original (8 GPUs) | 1-node config (4 GPUs) | Notes |
|---|---|---|
| tp1 pp1 ep1 → dp8 | tp1 pp1 ep1 → dp4 | trivial |
| tp2 pp1 ep1 → dp4 | tp2 pp1 ep1 → dp2 | trivial |
| tp1 pp2 ep1 → dp4 | tp1 pp2 ep1 → dp2 | trivial |
| tp4 pp1 ep1 → dp2 | tp4 pp1 ep1 → dp1 | trivial |
| tp1 pp4 ep1 → dp2 | tp1 pp4 ep1 → dp1 | trivial |
| tp1 pp1 ep8 → dp8 | tp1 pp1 ep4 → dp4 | ep 8→4 |
| tp4 pp2 ep2 etp2 → dp1 | tp4 pp1 ep2 etp2 → dp1 | pp 2→1 |
mr-scoped tests in gpt.yaml and moe.yaml not yet in *-1node.yaml_1node/model_config.yaml for each testnodes: 1, gpus: 4mr-github scope (+ mr-github-slim for 1–2 representative tests per recipe)mr-github-slim overload (slim suite should stay small)