ワンクリックでManusで任意のスキルを実行

$pwd:

build-and-dependency

Name: Build And Dependency
Author: NVIDIA

// Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.

Manusで実行

$ git log --oneline --stat

stars:16,430

forks:3,990

updated:2026年5月2日 07:39

SKILL.md

readonly

related-skills.json

同じリポジトリ

create-issue.md

from "NVIDIA/Megatron-LM"

Investigate a failing GitHub Actions run or job and create a GitHub issue for the failure.

2026-05-1816.4k

bump-base-image.md

from "NVIDIA/Megatron-LM"

Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs

2026-05-1116.4k

update-golden-values.md

from "NVIDIA/Megatron-LM"

Refresh golden values from a GitHub Actions workflow run (failing-only or all jobs), score the change with average normalized relative differences, and produce a PR-ready summary. Use when the user asks to update goldens for a CI run, refresh golden values from a workflow ID, or generate a golden-value diff summary for a PR description.

2026-05-1116.4k

nightly-sync.md

from "NVIDIA/Megatron-LM"

Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.

2026-05-0516.4k

cicd.md

from "NVIDIA/Megatron-LM"

CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.

2026-05-0216.4k

linting-and-formatting.md

from "NVIDIA/Megatron-LM"

Linting and formatting for Megatron-LM. Covers running autoformat.sh, tools (ruff, black, isort, pylint, mypy), and code style rules.

2026-05-0216.4k

package.json

"author": "NVIDIA"

"repository": "NVIDIA/Megatron-LM"

GitHub リポジトリを開く Creator のリポジトリを見る

$ install --global

$ download --local

Manusで実行

$ useful --forSOC

ソフトウェア開発者コンピュータ・数学職15-1252L4

name	build-and-dependency
description	Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.
when_to_use	Adding, removing, or updating a dependency; editing pyproject.toml or uv.lock; uv.lock merge conflict; setting up a dev environment; pulling or building the CI container; container build errors; uv errors; 'how do I install', 'uv sync fails', 'ModuleNotFoundError'.

Build & Dependency Guide

The core principle: build and develop inside containers — the CI container ships the correct CUDA toolkit, PyTorch build, and pre-compiled native extensions (TransformerEngine, DeepEP, …) that cannot be reproduced on a bare host.

Why Containers

Megatron-LM depends on CUDA, NCCL, PyTorch with GPU support, TransformerEngine, and optional components like ModelOpt and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships Dockerfiles that pin every dependency.

Use the container as your development environment. This guarantees:

Identical CUDA / NCCL / cuDNN versions across all developers and CI.
uv.lock resolves the same way locally and in CI.
GPU-dependent operations (training, testing) work out of the box.

dev vs lts

Two image variants exist, controlled by the IMAGE_TYPE build arg and the container::lts PR label:

Variant	Base image pin	uv group	When used
`dev`	`docker/.ngc_version.dev`	`dev`	Default — CI, local development, most PRs
`lts`	`docker/.ngc_version.lts`	`lts`	Stability testing; excludes ModelOpt and other bleeding-edge extras

Use dev for everything unless you have a specific reason to test lts. CI runs dev by default; attach container::lts to a PR only when verifying compatibility with the stable stack (e.g. a dependency upgrade that must not break LTS users). The @pytest.mark.flaky_in_dev marker skips tests in the dev environment; @pytest.mark.flaky skips them in lts.

Step 1 — Acquire an Image

Option A — NVIDIA-internal: pull a CI-built image

⚠️ Requires access to the internal GitLab instance. See @tools/trigger_internal_ci.md for setup (adding the git remote, obtaining a token).

The internal GitLab CI publishes images to its container registry. Derive the registry host from your configured gitlab remote — the same host you use for trigger_internal_ci.py:

# Derive host from your 'gitlab' remote:
GITLAB_HOST=$(git remote get-url gitlab | sed 's/.*@\(.*\):.*/\1/')

docker pull ${GITLAB_HOST}/adlr/megatron-lm/mcore_ci_dev:main

Option B — Build from scratch (works for everyone)

⚠️ Dockerfile.ci.dev has two stages: main and jet. The jet stage requires an internal build secret and will fail without it. Always pass --target main to stop at the public stage.

# dev image (default)
docker build \
  --target main \
  --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev) \
  --build-arg IMAGE_TYPE=dev \
  -f docker/Dockerfile.ci.dev \
  -t megatron-lm:local .

# lts image
docker build \
  --target main \
  --build-arg FROM_IMAGE_NAME=$(cat docker/.ngc_version.lts) \
  --build-arg IMAGE_TYPE=lts \
  -f docker/Dockerfile.ci.dev \
  -t megatron-lm:local-lts .

Which image variant is used is controlled by the PR label container::lts; absent that label, dev is used.

Step 2 — Launch the Container

Option A — Local Docker runtime

docker run --rm --gpus all \
  -v $(pwd):/workspace \
  -w /workspace \
  megatron-lm:local \
  bash -c "<your command>"

Option B — Slurm cluster (for those without a local Docker runtime)

NVIDIA clusters typically use Pyxis + enroot. Request an interactive session:

srun \
  --nodes=1 --gpus-per-node=8 \
  --container-image megatron-lm:local \
  --container-mounts $(pwd):/workspace \
  --container-workdir /workspace \
  --pty bash

For clusters that require a .sqsh archive first:

enroot import -o megatron-lm.sqsh dockerd://megatron-lm:local
srun \
  --nodes=1 --gpus-per-node=8 \
  --container-image $(pwd)/megatron-lm.sqsh \
  --container-mounts $(pwd):/workspace \
  --container-workdir /workspace \
  --pty bash

Dependency Management

Dependencies are declared in pyproject.toml. The venv lives at /opt/venv inside the container (already on PATH).

All uv operations must be run inside the container. Never run uv sync / uv pip install on the host.

uv Dependency Groups

Group	Purpose
`training`	Runtime training extras
`dev`	Full dev environment (TransformerEngine, ModelOpt, …)
`lts`	LTS-safe subset (no ModelOpt)
`test`	pytest, coverage, nemo-run
`linting`	ruff, black, isort, pylint
`build`	Cython, pybind11, nvidia-mathdx

Install commands (inside the container):

# Full dev + test environment
uv sync --locked --group dev --group test

# Linting only
uv sync --locked --only-group linting

# LTS environment
uv sync --locked --group lts --group test

Several dependencies are sourced directly from git (TransformerEngine, nemo-run, FlashMLA, Emerging-Optimizers, nvidia-resiliency-ext). The locked uv.lock file pins exact revisions; update it with uv lock when changing pyproject.toml.

Adding a New Dependency

Follow this three-step workflow:

Acquire a container image — see Step 1 above.
Launch the container interactively — see Step 2 above.

Update the lock file inside the container, then commit it:

# Inside the container:
uv add <package>          # adds to pyproject.toml and resolves
uv lock                   # regenerates uv.lock
# Exit the container, then on the host:
git add pyproject.toml uv.lock
git commit -S -s -m "build: add <package> dependency"

Resolving a merge conflict in uv.lock

uv.lock is machine-generated; never resolve conflicts manually. Instead:

git checkout origin/main -- uv.lock   # take main's version as the base
# then inside the container:
uv lock                               # re-resolve on top of your pyproject.toml changes

Common Pitfalls

Problem	Cause	Fix
`uv sync --locked` fails	Dependency conflict or stale `uv.lock`	Re-run `uv lock` inside the container and commit updated lock
`ModuleNotFoundError` after pip install	pip installed outside the uv-managed venv	Use `uv add` and `uv sync`, never bare `pip install`
`uv: command not found` inside container	Wrong container image	Use the `megatron-lm` image built from `Dockerfile.ci.dev`
`No space left on device` during uv ops	Cache fills container's `/root/.cache/`	Mount a host cache dir via `-v $HOME/.cache/uv:/root/.cache/uv`
`docker build` fails with secret-related error	`Dockerfile.ci.dev` has a `jet` stage that requires an internal secret	Add `--target main` to stop before the `jet` stage
`access forbidden` when pulling	Registry URL includes an explicit port (e.g. `:5005`)	Use `${GITLAB_HOST}/adlr/...` with no port — the sed extracts the hostname only

build-and-dependency

このリポジトリの他の Skills

このリポジトリの他の Skills

Build & Dependency Guide

Why Containers

dev vs lts

Step 1 — Acquire an Image

Step 2 — Launch the Container

Dependency Management

uv Dependency Groups

Adding a New Dependency

Resolving a merge conflict in uv.lock

Common Pitfalls

Build & Dependency Guide

Why Containers

dev vs lts

Step 1 — Acquire an Image

Step 2 — Launch the Container

Dependency Management

uv Dependency Groups

Adding a New Dependency

Resolving a merge conflict in uv.lock

Common Pitfalls