with one click
testing
Testing reference for Megatron Bridge — unit and functional test layout, tier semantics (L0/L1/L2/flaky), script conventions, running tests locally, adding/moving/disabling tests, and pytest conventions.
Menu
Testing reference for Megatron Bridge — unit and functional test layout, tier semantics (L0/L1/L2/flaky), script conventions, running tests locally, adding/moving/disabling tests, and pytest conventions.
Structured single-agent code review workflow for PRs, commits, and local diffs. Use when asked to review code, understand a PR, rubber duck a change, prepare GitHub review comments, compare a change against Megatron Bridge conventions, or produce high-signal findings without subagents or tmux.
Choose the right MoE token dispatcher (`alltoall`, DeepEP, or HybridEP) for the hardware, EP degree, and optimization stage. Summarizes patterns from DSV3, Qwen3, Qwen3-Next, and VLM bring-up work.
Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks.
Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments.
Systematic workflow for MoE training optimization in Megatron Bridge, based on the Megatron-Core MoE paper. Covers the Three Walls framework, parallel folding, recompute strategy, dispatcher choice, and CUDA-graph bring-up.
Practical guidance for training MoE VLMs in Megatron Bridge. Compares FSDP and 3D-parallel approaches, using rounded lessons from Qwen3-VL, Qwen3-Next, and other multimodal experiments.
| name | testing |
| description | Testing reference for Megatron Bridge — unit and functional test layout, tier semantics (L0/L1/L2/flaky), script conventions, running tests locally, adding/moving/disabling tests, and pytest conventions. |
| when_to_use | Adding, running, moving, or disabling tests; debugging a test failure; choosing the right test tier; understanding L0 vs L1 vs L2; handling flaky tests; 'add a test', 'which tier', 'functional test layout'. |
tests/
unit_tests/ # fast, isolated, no GPU required
functional_tests/
launch_scripts/
h100/
active/ # H100 tests that run in CI automatically
flaky/ # H100 tests quarantined from blocking CI
gb200/
active/ # GB200 tests that run in CI automatically
flaky/ # GB200 tests quarantined from blocking CI
Unit tests are independent of the launch script layout. Functional test
scripts are named {Tier}_{Description}.sh (e.g., L0_Launch_training.sh).
| Tier | Trigger | Blocking |
|---|---|---|
| L0 | Every PR, every push to main, schedule | Yes — PR cannot merge if L0 fails |
| L1 | Push to main, schedule, PRs with needs-more-tests label | Yes |
| L2 | Schedule, workflow_dispatch, PRs with full-test-suite label | Yes (when triggered) |
| flaky | workflow_dispatch with test_suite=all only | No — failures are informational |
H100 and GB200 each have independent L0/L1/L2/flaky jobs. Moving a script to
flaky/ removes it from blocking CI on that hardware target only.
Prefer unit tests over functional tests. CI GPU resources are limited; every functional test slot has a real cost.
No GPU required:
uv run pytest tests/unit_tests/ -x -v
Or inside Docker:
docker run --rm --gpus all -v $(pwd):/workdir/ -w /workdir/ megatron-bridge \
uv run pytest tests/unit_tests/
Run the corresponding launch script directly on a GPU node:
bash tests/functional_tests/launch_scripts/h100/active/L0_Launch_training.sh
tests/unit_tests/<domain>/test_<name>.py.@pytest.mark.unit.uv run python -m pytest tests/unit_tests/<your_test>.pyNo foreign setattr on config dataclasses. When applying overrides via
setattr(config_obj, key, value), always guard first:
if not hasattr(config_obj, key):
raise ValueError(f"Config has no field '{key}'")
setattr(config_obj, key, value)
Setting a non-existent attribute silently creates a phantom field — the test passes but the recipe fails for a real user.
tests/functional_tests/launch_scripts/{h100,gb200}/active/.# CI_TIMEOUT=<minutes>
{Tier}_{CamelDescription}.sh — the tier prefix controls which CI matrix includes it.chmod +x <file>.No workflow file changes needed — the matrix is generated dynamically by scanning the directory.
# H100
git mv tests/functional_tests/launch_scripts/h100/active/L0_Foo.sh \
tests/functional_tests/launch_scripts/h100/flaky/L0_Foo.sh
# GB200 (if the test also exists there)
git mv tests/functional_tests/launch_scripts/gb200/active/L0_Foo.sh \
tests/functional_tests/launch_scripts/gb200/flaky/L0_Foo.sh
Flaky tests still run on manual dispatches (test_suite=all) so failures
remain visible. Move back to active/ once the underlying issue is fixed.
Delete the script file and commit. No other changes required.
unit, integration, system, acceptance, docs, skipduringci, pleasefixme.CUDA_VISIBLE_DEVICES explicitly for multi-GPU tests.uv run python -m pytest, never bare pytest.| GitHub Actions job | Hardware | Directory scanned |
|---|---|---|
cicd-functional-tests-l0 | H100 | h100/active/L0_*.sh |
cicd-functional-tests-l1 | H100 | h100/active/L1_*.sh |
cicd-functional-tests-l2 | H100 | h100/active/L2_*.sh |
cicd-functional-tests-flaky | H100 | h100/flaky/L*.sh |
cicd-functional-tests-gb200-l0 | GB200 | gb200/active/L0_*.sh |
cicd-functional-tests-gb200-l1 | GB200 | gb200/active/L1_*.sh |
cicd-functional-tests-gb200-l2 | GB200 | gb200/active/L2_*.sh |
cicd-functional-tests-gb200-flaky | GB200 | gb200/flaky/L*.sh |
Hardware runners: H100 uses nemo-ci-{azure,aws}-gpu-x2; GB200 uses nemo-ci-gcp-gpu-x2.
| Component | Path |
|---|---|
| Matrix generation (H100) | @.github/workflows/cicd-main.yml job generate-test-matrix |
| Matrix generation (GB200) | @.github/workflows/cicd-main.yml job generate-gb200-test-matrix |
| Test runner action | @.github/actions/test-template/action.yml |
| Launch scripts root | tests/functional_tests/launch_scripts/ |