Run any Skill in Manus with one click

mcore-cicd

CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI (which force-pushes the current branch to a pull-request/BRANCH ref — always dry-run and verify the destination first; never run against shared or protected branches), and CI failure investigation.

Run Skill in Manus

Stars16,645

Forks4,054

UpdatedMay 29, 2026 at 20:11

Source

NVIDIA

NVIDIA/Megatron-LM

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Technical WritersArts, Design, Entertainment, Sports, and Media Occupations27-3042L4

File Explorer

5 files

SKILL.md

readonly

More from this repository

same repository

mcore-build-and-dependency

NVIDIA/Megatron-LM

Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.

2026-05-2916.6k

mcore-bump-base-image

NVIDIA/Megatron-LM

Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:YY.MM-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs

2026-05-2916.6k

mcore-create-issue

NVIDIA/Megatron-LM

Investigate a failing GitHub Actions run or job and create a GitHub issue for the failure.

2026-05-2916.6k

mcore-linting-and-formatting

NVIDIA/Megatron-LM

Linting and formatting for Megatron-LM. Covers running autoformat.sh, tools (ruff, black, isort, pylint, mypy), and code style rules.

2026-05-2916.6k

mcore-onboard-gb200-1node-tests

NVIDIA/Megatron-LM

Onboard 1-node GitHub MR functional tests for GB200 from existing mr-scoped 2-node tests.

2026-05-2916.6k

mcore-run-on-slurm

NVIDIA/Megatron-LM

How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.

2026-05-2916.6k

name	mcore-cicd
description	CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI (which force-pushes the current branch to a pull-request/BRANCH ref — always dry-run and verify the destination first; never run against shared or protected branches), and CI failure investigation.
license	Apache-2.0
when_to_use	Investigating a CI failure; understanding the pipeline structure; which CI label to attach; triggering internal GitLab CI; 'CI is red', 'how do I trigger CI', 'PR labels', 'where are the logs', 'pull-request branch'.
metadata	{"author":"Philip Petrakian <ppetrakian@nvidia.com>"}

CI/CD Guide

Answer-First CI Facts

For PR-label or trigger questions, lead with the exact values:

No label: scope=mr-github-slim, n_repeat=5, lightweight=false.
Run tests: scope=mr-github, n_repeat=1, lightweight=true.
Run functional tests: scope=mr-github, n_repeat=5, lightweight=false.
container::lts only switches the container image path to LTS and combines with any scope label.
Run MBridge tests additionally triggers the MBridge L1 suite.
⚠️ WARNING — destructive remote write. tools/trigger_internal_ci.py force-pushes the current branch to the internal GitLab remote as pull-request/<branch>. Always run with --dry-run first and confirm the destination ref before invoking it without the flag. Never run against a shared or protected branch — only target your own pull-request branch. Safe preflight: python tools/trigger_internal_ci.py --gitlab-origin gitlab --dry-run. Add the optional --functional-test-* flags only after the dry-run output matches the intended destination.

CI Pipeline Structure

The main workflow is .github/workflows/cicd-main.yml. It triggers on pushes to branches matching pull-request/[0-9]+ and deploy-release/*, on merge groups, on a daily schedule, and on manual dispatch.

is-not-external-contributor
  └─ pre-flight
       └─ configure          # determines scope, container tag, n_repeat
            ├─ linting
            ├─ cicd-container-build
            │    ├─ cicd-parse-unit-tests → cicd-unit-tests-latest
            │    ├─ cicd-parse-integration-tests-h100 → cicd-integration-tests-latest-h100
            │    └─ cicd-parse-integration-tests-gb200 → cicd-integration-tests-latest-gb200 (maintainers only)
            └─ Nemo_CICD_Test  # final pass/fail gate

Images are pushed to:

AWS ECR: 766267172432.dkr.ecr.us-east-1.amazonaws.com/…
GCP Artifact Registry: us-east4-docker.pkg.dev/nv-projdgxchipp-20260113193621/megatron-lm/…

CI Test Scope Labels

The CI pipeline reads PR labels to decide test scope, n_repeat, and container image.

Decision tree (first match wins):

Condition	`scope`	`n_repeat`	`lightweight`	Notes
Merge group	`mr-github`	1	false	Automatic, no label needed
Label: `Run tests`	`mr-github`	1	true	Trains 4 steps, no golden-value compare
Label: `Run functional tests`	`mr-github`	5	false	Trains 100 steps, golden-value compare
(no label)	`mr-github-slim`	5	false	Slim subset only

Orthogonal image label:

Label	Effect
`container::lts`	Use the LTS base image instead of `dev` (combinable with any scope label)
`Run MBridge tests`	Also triggers the MBridge L1 test suite

Which label to attach when opening a PR

Changed paths / nature of change	Label to attach
Docs only (`docs/`, `*.md`, docstrings)	(none)
CI/tooling only (`.github/`, `tools/`, `Makefile`)	(none)
Test files only (`tests/`) — existing tests, no new golden values	`Run tests`
New test cases added (no golden values exist yet)	`Run functional tests`
Re-enabling a disabled test (scope `-broken` → active)	`Run functional tests`
Non-numerical library code (logging, error handling, CLI flags, refactors)	`Run tests`
Could affect training numerics (model arch, attention, optimizer, distributed, MoE routing)	`Run functional tests`
Container or dependency changes (`docker/`, `pyproject.toml`, `uv.lock`)	`Run tests` + `container::lts`
Touches MBridge integration	add `Run MBridge tests`

Rule of thumb: default to Run tests. Always use Run functional tests when the PR adds new test cases (golden values must be generated) or when the change could plausibly shift loss curves.

Triggering Internal CI

Use tools/trigger_internal_ci.py after the internal GitLab remote and GITLAB_TOKEN are configured; see @tools/trigger_internal_ci.md for setup details. First run a dry run and verify the destination ref:

python tools/trigger_internal_ci.py --gitlab-origin gitlab --dry-run

The script force-pushes the current branch to pull-request/<branch> before triggering the pipeline. Only target your own pull-request branch, never a shared or protected branch. Add optional --functional-test-* flags only after the dry-run output matches the intended destination.

CI Failure Investigation

CI branches always follow the pattern pull-request/<number>.

Locating the PR from a CI Branch

# Extract PR number from the current branch
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')

# Fetch the PR metadata (title, labels, author, base branch)
gh pr view "$PR_NUMBER" --repo NVIDIA/Megatron-LM

# Show the changeset for that PR
gh pr diff "$PR_NUMBER" --repo NVIDIA/Megatron-LM

Reading CI Job Logs

# List recent workflow runs for the PR
gh run list --repo NVIDIA/Megatron-LM --branch "pull-request/$PR_NUMBER"

# Stream failing job output
gh run view <run-id> --repo NVIDIA/Megatron-LM --log-failed

Full per-rank logs are not in the runner stdout. They are uploaded as GitHub artifacts named logs-<test_case>-<run_id>-<uuid>.

# 1. Find artifact name
gh run view <run-id> --repo NVIDIA/Megatron-LM --json artifacts \
  --jq '.artifacts[].name'

# 2. Download the artifact zip
gh run download <run-id> --repo NVIDIA/Megatron-LM \
  --name "logs-<artifact-name>" -D ./ci-logs

# 3. Locate which rank logs contain errors
grep -r -l "ERROR\|Traceback\|FAILED\|fatal" ./ci-logs/

# 4. Log files can exceed 10 000 lines — never read a full log at once.
wc -l ./ci-logs/<test>/<attempt>/attempt_0/<rank>/stderr.log
sed -n '1,200p' ./ci-logs/.../stderr.log   # read in chunks

Identifying Failure Root Cause

Linting failure — re-run tools/autoformat.sh locally; the diff shows exactly what needs to change.
Container build failure — inspect the cicd-container-build job log.
Unit test failure — the failing bucket is in the cicd-unit-tests-latest job matrix.
Functional test failure — look at the cicd-integration-tests-* job. Start with stdout.log for rank 0.
Flaky test — the runner retries automatically up to 3 times. If all retries exhausted and the pattern matches a known transient (NCCL, ECC, segfault), it is infrastructure noise.

Correlating a Failure with the PR Changeset

# Find unit tests that cover a changed source file
grep -r "from megatron.core.transformer.attention" tests/unit_tests/ -l

# Check CODEOWNERS for reviewer assignment
cat .github/CODEOWNERS | grep "<changed-path>"