ワンクリックで
build-and-dependency
// Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.
// Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.
Investigate a failing GitHub Actions run or job and create a GitHub issue for the failure.
Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs
Refresh golden values from a GitHub Actions workflow run (failing-only or all jobs), score the change with average normalized relative differences, and produce a PR-ready summary. Use when the user asks to update goldens for a CI run, refresh golden values from a workflow ID, or generate a golden-value diff summary for a PR description.
Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.
CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.
Linting and formatting for Megatron-LM. Covers running autoformat.sh, tools (ruff, black, isort, pylint, mypy), and code style rules.
| name | build-and-dependency |
| description | Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock. |
| when_to_use | Adding, removing, or updating a dependency; editing pyproject.toml or uv.lock; uv.lock merge conflict; setting up a dev environment; pulling or building the CI container; container build errors; uv errors; 'how do I install', 'uv sync fails', 'ModuleNotFoundError'. |
The core principle: build and develop inside containers — the CI container ships the correct CUDA toolkit, PyTorch build, and pre-compiled native extensions (TransformerEngine, DeepEP, …) that cannot be reproduced on a bare host.
Megatron-LM depends on CUDA, NCCL, PyTorch with GPU support, TransformerEngine, and optional components like ModelOpt and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships Dockerfiles that pin every dependency.
Use the container as your development environment. This guarantees:
uv.lock resolves the same way locally and in CI.Two image variants exist, controlled by the IMAGE_TYPE build arg and the
container::lts PR label:
| Variant | Base image pin | uv group | When used |
|---|---|---|---|
dev | docker/.ngc_version.dev | dev | Default — CI, local development, most PRs |
lts | docker/.ngc_version.lts | lts | Stability testing; excludes ModelOpt and other bleeding-edge extras |
Use dev for everything unless you have a specific reason to test lts.
CI runs dev by default; attach container::lts to a PR only when verifying
compatibility with the stable stack (e.g. a dependency upgrade that must not
break LTS users). The @pytest.mark.flaky_in_dev marker skips tests in the
dev environment; @pytest.mark.flaky skips them in lts.
Option A — NVIDIA-internal: pull a CI-built image
⚠️ Requires access to the internal GitLab instance. See @tools/trigger_internal_ci.md for setup (adding the git remote, obtaining a token).
The internal GitLab CI publishes images to its container registry.
Derive the registry host from your configured gitlab remote — the same
host you use for trigger_internal_ci.py:
# Derive host from your 'gitlab' remote:
GITLAB_HOST=$(git remote get-url gitlab | sed 's/.*@\(.*\):.*/\1/')
docker pull ${GITLAB_HOST}/adlr/megatron-lm/mcore_ci_dev:main
Option B — Build from scratch (works for everyone)
⚠️
Dockerfile.ci.devhas two stages:mainandjet. Thejetstage requires an internal build secret and will fail without it. Always pass--target mainto stop at the public stage.
# dev image (default)
docker build \
--target main \
--build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) \
--build-arg IMAGE_TYPE=dev \
-f docker/Dockerfile.ci.dev \
-t megatron-lm:local .
# lts image
docker build \
--target main \
--build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.lts) \
--build-arg IMAGE_TYPE=lts \
-f docker/Dockerfile.ci.dev \
-t megatron-lm:local-lts .
Which image variant is used is controlled by the PR label container::lts;
absent that label, dev is used.
Option A — Local Docker runtime
docker run --rm --gpus all \
-v $(pwd):/workspace \
-w /workspace \
megatron-lm:local \
bash -c "<your command>"
Option B — Slurm cluster (for those without a local Docker runtime)
NVIDIA clusters typically use Pyxis + enroot. Request an interactive session:
srun \
--nodes=1 --gpus-per-node=8 \
--container-image megatron-lm:local \
--container-mounts $(pwd):/workspace \
--container-workdir /workspace \
--pty bash
For clusters that require a .sqsh archive first:
enroot import -o megatron-lm.sqsh dockerd://megatron-lm:local
srun \
--nodes=1 --gpus-per-node=8 \
--container-image $(pwd)/megatron-lm.sqsh \
--container-mounts $(pwd):/workspace \
--container-workdir /workspace \
--pty bash
Dependencies are declared in pyproject.toml. The venv lives at /opt/venv
inside the container (already on PATH).
All
uvoperations must be run inside the container. Never runuv sync/uv pip installon the host.
| Group | Purpose |
|---|---|
training | Runtime training extras |
dev | Full dev environment (TransformerEngine, ModelOpt, …) |
lts | LTS-safe subset (no ModelOpt) |
test | pytest, coverage, nemo-run |
linting | ruff, black, isort, pylint |
build | Cython, pybind11, nvidia-mathdx |
Install commands (inside the container):
# Full dev + test environment
uv sync --locked --group dev --group test
# Linting only
uv sync --locked --only-group linting
# LTS environment
uv sync --locked --group lts --group test
Several dependencies are sourced directly from git (TransformerEngine, nemo-run,
FlashMLA, Emerging-Optimizers, nvidia-resiliency-ext). The locked uv.lock file
pins exact revisions; update it with uv lock when changing pyproject.toml.
Follow this three-step workflow:
Acquire a container image — see Step 1 above.
Launch the container interactively — see Step 2 above.
Update the lock file inside the container, then commit it:
# Inside the container:
uv add <package> # adds to pyproject.toml and resolves
uv lock # regenerates uv.lock
# Exit the container, then on the host:
git add pyproject.toml uv.lock
git commit -S -s -m "build: add <package> dependency"
uv.lock is machine-generated; never resolve conflicts manually. Instead:
git checkout origin/main -- uv.lock # take main's version as the base
# then inside the container:
uv lock # re-resolve on top of your pyproject.toml changes
| Problem | Cause | Fix |
|---|---|---|
uv sync --locked fails | Dependency conflict or stale uv.lock | Re-run uv lock inside the container and commit updated lock |
ModuleNotFoundError after pip install | pip installed outside the uv-managed venv | Use uv add and uv sync, never bare pip install |
uv: command not found inside container | Wrong container image | Use the megatron-lm image built from Dockerfile.ci.dev |
No space left on device during uv ops | Cache fills container's /root/.cache/ | Mount a host cache dir via -v $HOME/.cache/uv:/root/.cache/uv |
docker build fails with secret-related error | Dockerfile.ci.dev has a jet stage that requires an internal secret | Add --target main to stop before the jet stage |
access forbidden when pulling | Registry URL includes an explicit port (e.g. :5005) | Use ${GITLAB_HOST}/adlr/... with no port — the sed extracts the hostname only |