| name | mcore-cicd |
| description | CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI (which force-pushes the current branch to a pull-request/BRANCH ref — always dry-run and verify the destination first; never run against shared or protected branches), and CI failure investigation. |
| license | Apache-2.0 |
| when_to_use | Investigating a CI failure; understanding the pipeline structure; which CI label to attach; triggering internal GitLab CI; 'CI is red', 'how do I trigger CI', 'PR labels', 'where are the logs', 'pull-request branch'. |
| metadata | {"author":"Philip Petrakian <ppetrakian@nvidia.com>"} |
CI/CD Guide
Answer-First CI Facts
For PR-label or trigger questions, lead with the exact values:
- No label:
scope=mr-github-slim, n_repeat=5, lightweight=false.
Run tests: scope=mr-github, n_repeat=1, lightweight=true.
Run functional tests: scope=mr-github, n_repeat=5, lightweight=false.
container::lts only switches the container image path to LTS and combines
with any scope label.
Run MBridge tests additionally triggers the MBridge L1 suite.
- ⚠️ WARNING — destructive remote write.
tools/trigger_internal_ci.py
force-pushes the current branch to the internal GitLab remote as
pull-request/<branch>. Always run with --dry-run first and confirm the
destination ref before invoking it without the flag. Never run against a
shared or protected branch — only target your own pull-request branch.
Safe preflight: python tools/trigger_internal_ci.py --gitlab-origin gitlab --dry-run.
Add the optional --functional-test-* flags only after the dry-run output
matches the intended destination.
CI Pipeline Structure
The main workflow is .github/workflows/cicd-main.yml. It triggers on pushes
to branches matching pull-request/[0-9]+ and deploy-release/*, on merge
groups, on a daily schedule, and on manual dispatch.
is-not-external-contributor
└─ pre-flight
└─ configure # determines scope, container tag, n_repeat
├─ linting
├─ cicd-container-build
│ ├─ cicd-parse-unit-tests → cicd-unit-tests-latest
│ ├─ cicd-parse-integration-tests-h100 → cicd-integration-tests-latest-h100
│ └─ cicd-parse-integration-tests-gb200 → cicd-integration-tests-latest-gb200 (maintainers only)
└─ Nemo_CICD_Test # final pass/fail gate
Images are pushed to:
- AWS ECR:
766267172432.dkr.ecr.us-east-1.amazonaws.com/…
- GCP Artifact Registry:
us-east4-docker.pkg.dev/nv-projdgxchipp-20260113193621/megatron-lm/…
CI Test Scope Labels
The CI pipeline reads PR labels to decide test scope, n_repeat, and container image.
Decision tree (first match wins):
| Condition | scope | n_repeat | lightweight | Notes |
|---|
| Merge group | mr-github | 1 | false | Automatic, no label needed |
Label: Run tests | mr-github | 1 | true | Trains 4 steps, no golden-value compare |
Label: Run functional tests | mr-github | 5 | false | Trains 100 steps, golden-value compare |
| (no label) | mr-github-slim | 5 | false | Slim subset only |
Orthogonal image label:
| Label | Effect |
|---|
container::lts | Use the LTS base image instead of dev (combinable with any scope label) |
Run MBridge tests | Also triggers the MBridge L1 test suite |
Which label to attach when opening a PR
| Changed paths / nature of change | Label to attach |
|---|
Docs only (docs/, *.md, docstrings) | (none) |
CI/tooling only (.github/, tools/, Makefile) | (none) |
Test files only (tests/) — existing tests, no new golden values | Run tests |
| New test cases added (no golden values exist yet) | Run functional tests |
Re-enabling a disabled test (scope -broken → active) | Run functional tests |
| Non-numerical library code (logging, error handling, CLI flags, refactors) | Run tests |
| Could affect training numerics (model arch, attention, optimizer, distributed, MoE routing) | Run functional tests |
Container or dependency changes (docker/, pyproject.toml, uv.lock) | Run tests + container::lts |
| Touches MBridge integration | add Run MBridge tests |
Rule of thumb: default to Run tests. Always use Run functional tests when the PR adds new test cases (golden values must be generated) or when the change could plausibly shift loss curves.
Triggering Internal CI
Use tools/trigger_internal_ci.py after the internal GitLab remote and
GITLAB_TOKEN are configured; see @tools/trigger_internal_ci.md for setup
details. First run a dry run and verify the destination ref:
python tools/trigger_internal_ci.py --gitlab-origin gitlab --dry-run
The script force-pushes the current branch to pull-request/<branch> before
triggering the pipeline. Only target your own pull-request branch, never a shared
or protected branch. Add optional --functional-test-* flags only after the
dry-run output matches the intended destination.
CI Failure Investigation
CI branches always follow the pattern pull-request/<number>.
Locating the PR from a CI Branch
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM
gh pr diff "$PR_NUMBER" --repo NVIDIA/Megatron-LM
Reading CI Job Logs
gh run list --repo NVIDIA/Megatron-LM --branch "pull-request/$PR_NUMBER"
gh run view <run-id> --repo NVIDIA/Megatron-LM --log-failed
Full per-rank logs are not in the runner stdout. They are uploaded as
GitHub artifacts named logs-<test_case>-<run_id>-<uuid>.
gh run view <run-id> --repo NVIDIA/Megatron-LM --json artifacts \
--jq '.artifacts[].name'
gh run download <run-id> --repo NVIDIA/Megatron-LM \
--name "logs-<artifact-name>" -D ./ci-logs
grep -r -l "ERROR\|Traceback\|FAILED\|fatal" ./ci-logs/
wc -l ./ci-logs/<test>/<attempt>/attempt_0/<rank>/stderr.log
sed -n '1,200p' ./ci-logs/.../stderr.log
Identifying Failure Root Cause
- Linting failure — re-run
tools/autoformat.sh locally; the diff shows exactly what needs to change.
- Container build failure — inspect the
cicd-container-build job log.
- Unit test failure — the failing bucket is in the
cicd-unit-tests-latest job matrix.
- Functional test failure — look at the
cicd-integration-tests-* job. Start with stdout.log for rank 0.
- Flaky test — the runner retries automatically up to 3 times. If all retries exhausted and the pattern matches a known transient (NCCL, ECC, segfault), it is infrastructure noise.
Correlating a Failure with the PR Changeset
grep -r "from megatron.core.transformer.attention" tests/unit_tests/ -l
cat .github/CODEOWNERS | grep "<changed-path>"