بنقرة واحدة
cicd
// CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.
// CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.
Investigate a failing GitHub Actions run or job and create a GitHub issue for the failure.
Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs
Refresh golden values from a GitHub Actions workflow run (failing-only or all jobs), score the change with average normalized relative differences, and produce a PR-ready summary. Use when the user asks to update goldens for a CI run, refresh golden values from a workflow ID, or generate a golden-value diff summary for a PR description.
Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.
Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.
Linting and formatting for Megatron-LM. Covers running autoformat.sh, tools (ruff, black, isort, pylint, mypy), and code style rules.
| name | cicd |
| description | CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation. |
| when_to_use | Investigating a CI failure; understanding the pipeline structure; which CI label to attach; triggering internal GitLab CI; 'CI is red', 'how do I trigger CI', 'PR labels', 'where are the logs', 'pull-request branch'. |
The main workflow is .github/workflows/cicd-main.yml. It triggers on pushes
to branches matching pull-request/[0-9]+ and deploy-release/*, on merge
groups, on a daily schedule, and on manual dispatch.
is-not-external-contributor
└─ pre-flight
└─ configure # determines scope, container tag, n_repeat
├─ linting
├─ cicd-container-build
│ ├─ cicd-parse-unit-tests → cicd-unit-tests-latest
│ ├─ cicd-parse-integration-tests-h100 → cicd-integration-tests-latest-h100
│ └─ cicd-parse-integration-tests-gb200 → cicd-integration-tests-latest-gb200 (maintainers only)
└─ Nemo_CICD_Test # final pass/fail gate
Images are pushed to:
766267172432.dkr.ecr.us-east-1.amazonaws.com/…us-east4-docker.pkg.dev/nv-projdgxchipp-20260113193621/megatron-lm/…The CI pipeline reads PR labels to decide test scope, n_repeat, and container image.
Decision tree (first match wins):
| Condition | scope | n_repeat | lightweight | Notes |
|---|---|---|---|---|
| Merge group | mr-github | 1 | false | Automatic, no label needed |
Label: Run tests | mr-github | 1 | true | Trains 4 steps, no golden-value compare |
Label: Run functional tests | mr-github | 5 | false | Trains 100 steps, golden-value compare |
| (no label) | mr-github-slim | 5 | false | Slim subset only |
Orthogonal image label:
| Label | Effect |
|---|---|
container::lts | Use the LTS base image instead of dev (combinable with any scope label) |
Run MBridge tests | Also triggers the MBridge L1 test suite |
| Changed paths / nature of change | Label to attach |
|---|---|
Docs only (docs/, *.md, docstrings) | (none) |
CI/tooling only (.github/, tools/, Makefile) | (none) |
Test files only (tests/) — existing tests, no new golden values | Run tests |
| New test cases added (no golden values exist yet) | Run functional tests |
Re-enabling a disabled test (scope -broken → active) | Run functional tests |
| Non-numerical library code (logging, error handling, CLI flags, refactors) | Run tests |
| Could affect training numerics (model arch, attention, optimizer, distributed, MoE routing) | Run functional tests |
Container or dependency changes (docker/, pyproject.toml, uv.lock) | Run tests + container::lts |
| Touches MBridge integration | add Run MBridge tests |
Rule of thumb: default to Run tests. Always use Run functional tests when the PR adds new test cases (golden values must be generated) or when the change could plausibly shift loss curves.
Use tools/trigger_internal_ci.py to push the current branch to the internal
GitLab remote and trigger a pipeline — without touching the GitLab UI.
Full setup and usage details: @tools/trigger_internal_ci.md.
Prerequisites (one-time):
# 1. Add the internal GitLab remote
git remote add gitlab git@<gitlab-hostname>:ADLR/Megatron-LM.git
# 2. Create a personal access token with 'api' scope on your GitLab profile,
# then store it:
export GITLAB_TOKEN=glpat-<your-token>
Usage:
python tools/trigger_internal_ci.py \
--gitlab-origin gitlab \
[--functional-test-scope mr] \
[--functional-test-repeat 5] \
[--functional-test-cases all] \
[--dry-run]
The script force-pushes the current branch as pull-request/<branch> and
prints the resulting pipeline URL.
CI branches always follow the pattern pull-request/<number>.
# Extract PR number from the current branch
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
# Fetch the PR metadata (title, labels, author, base branch)
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM
# Show the changeset for that PR
gh pr diff "$PR_NUMBER" --repo NVIDIA/Megatron-LM
# List recent workflow runs for the PR
gh run list --repo NVIDIA/Megatron-LM --branch "pull-request/$PR_NUMBER"
# Stream failing job output
gh run view <run-id> --repo NVIDIA/Megatron-LM --log-failed
Full per-rank logs are not in the runner stdout. They are uploaded as
GitHub artifacts named logs-<test_case>-<run_id>-<uuid>.
# 1. Find artifact name
gh run view <run-id> --repo NVIDIA/Megatron-LM --json artifacts \
--jq '.artifacts[].name'
# 2. Download the artifact zip
gh run download <run-id> --repo NVIDIA/Megatron-LM \
--name "logs-<artifact-name>" -D ./ci-logs
# 3. Locate which rank logs contain errors
grep -r -l "ERROR\|Traceback\|FAILED\|fatal" ./ci-logs/
# 4. Log files can exceed 10 000 lines — never read a full log at once.
wc -l ./ci-logs/<test>/<attempt>/attempt_0/<rank>/stderr.log
sed -n '1,200p' ./ci-logs/.../stderr.log # read in chunks
tools/autoformat.sh locally; the diff shows exactly what needs to change.cicd-container-build job log.cicd-unit-tests-latest job matrix.cicd-integration-tests-* job. Start with stdout.log for rank 0.# Find unit tests that cover a changed source file
grep -r "from megatron.core.transformer.attention" tests/unit_tests/ -l
# Check CODEOWNERS for reviewer assignment
cat .github/CODEOWNERS | grep "<changed-path>"