一键导入
nightly-sync
// Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.
// Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.
Investigate a failing GitHub Actions run or job and create a GitHub issue for the failure.
Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs
Refresh golden values from a GitHub Actions workflow run (failing-only or all jobs), score the change with average normalized relative differences, and produce a PR-ready summary. Use when the user asks to update goldens for a CI run, refresh golden values from a workflow ID, or generate a golden-value diff summary for a PR description.
Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.
CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.
Linting and formatting for Megatron-LM. Covers running autoformat.sh, tools (ruff, black, isort, pylint, mypy), and code style rules.
| name | nightly-sync |
| description | Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues. |
| when_to_use | Working on the nightly sync PR; investigating a nightly sync failure; resolving merge conflicts between main and dev; 'nightly sync failed', 'main-to-dev merge', 'sync bot'. |
This skill is read by the automated sync bot during the nightly-sync-main-to-dev workflow. It contains all domain knowledge for merging main into dev, resolving conflicts, iterating on CI, and shipping the PR.
$BRANCH from origin/devgit merge origin/main -X theirs --no-editDo NOT blanket-override all shared files with main's version. Dev has features not yet in main (new classes, new modules, new tests). The merge preserves both sides' non-conflicting additions — only intervene where there is an actual conflict.
Dev often develops features as a chain of PRs (PR1 → PR2 → PR3) where each
builds on the last. When PR1 is squash-merged to main, git sees main's squashed
version and dev's original commits as unrelated changes. -X theirs will pick
main's PR1 code and silently discard PR2/PR3's improvements on dev.
After the merge, check for this pattern:
-X theirs resolved a conflict, run
git log --oneline origin/dev -- <file> to see if dev has commits that
came AFTER the code main is bringing in.Practical check: run git diff origin/dev -- <file> on conflicted files. If
dev's code was removed or reverted, investigate whether dev's version is the
more evolved one.
Real examples from PR #4291:
emerging_optimizers.py: Main's version was MORE complete — it squash-merged
dev's PRs plus added more. -X theirs was correct.distrib_optimizer.py: Main overwrote dev's GroupedQuantizedTensor support.
Had to restore _is_distopt_quantized_param and the expanded
_expand_quantized_param_shard_for_cast loop while keeping main's NVFP4
additions. This required a surgical merge combining sections from both.Key insight: squash-merge chains can go in EITHER direction. Sometimes main is ahead (it squash-merged dev's work + more), sometimes dev is ahead (it has follow-up PRs). Always diff both ways before deciding which version to favor.
These files have known semantic conflicts where dev's versions reference args
or APIs that main removed or renamed. Take main's version with
git checkout origin/main -- <file>:
megatron/training/training.py — references dev-only argsmegatron/training/initialize.py — references dev-only argsmegatron/training/utils.py — references dev-only argsmegatron/training/datasets/data_samplers.py — references dev-only argsmegatron/core/optimizer/layer_wise_optimizer.py — constructor signatureCaveat for ALL overrides: After taking main's version of any file, you MUST run the API Mismatch Detection procedure (see below) on that file. Taking main's caller code while keeping dev's callee implementations is the #1 source of sync bugs.
IMPORTANT: Do NOT take main's pyproject.toml, uv.lock, or
docker/Dockerfile.ci.dev. These three files are a tightly coupled
triple — the Dockerfile's uv sync command must match the dependency
groups in pyproject.toml, and uv.lock must be consistent with both.
Main's versions are missing dev-only dependencies (e.g.
fast-hadamard-transform, correct TransformerEngine revision) and the
--group no_pypi_wheels flag needed to install them. Keep dev's versions
of all three files.
IMPORTANT: .github/CODEOWNERS must NEVER be modified by the sync
bot under any circumstances. Dev's CODEOWNERS is intentionally
different from main's — do not take main's version, do not merge them,
do not touch the file. If the merge produces a conflict or a non-zero
diff against origin/dev on this path, restore dev's version verbatim:
git checkout origin/dev -- .github/CODEOWNERS
Then verify with git diff origin/dev -- .github/CODEOWNERS — output
must be empty. Modifying CODEOWNERS triggers spurious reviewer
requests and conflicts with the dev team's governance; rolling back a
CODEOWNERS change after the PR lands is painful.
NEVER manually edit uv.lock. It is a machine-generated lockfile. If
it needs to change, it must be regenerated with uv lock inside a CUDA
container (see .claude/skills/build-and-test/SKILL.md).
After keeping dev's pyproject.toml, check whether main has added NEW git
sources to [tool.uv.sources] that don't exist in dev's version. Main's
merged code may import from packages only available at specific git revisions.
[tool.uv.sources] sections:
git show origin/main:pyproject.toml vs git show origin/dev:pyproject.tomlpyproject.tomlReal examples from PR #4291:
nvidia-resiliency-ext: Main's torch.py imports get_write_results_queue
which only existed in main's pinned git revision, not on PyPI. Had to add
main's git source to dev's pyproject.toml.nemo-run: Dev's pinned revision had a TOML parse error with uv 0.7.2.
Had to swap to main's revision.After any changes to pyproject.toml, regenerate uv.lock inside a CUDA
container:
docker run --rm -v $(pwd):/workspace nvcr.io/nvidia/pytorch:26.02-py3 \
bash -c "pip install uv==0.7.2 && cd /workspace && \
uv venv .venv --system-site-packages && uv sync --only-group build && uv lock"
# Clean up root-owned .venv:
docker run --rm -v $(pwd):/workspace nvcr.io/nvidia/pytorch:26.02-py3 \
bash -c "rm -rf /workspace/.venv"
The merge can create "Frankenstein" code where main's callers use dev's implementations (or vice versa) with different method signatures. This compiles fine but fails at runtime.
After the merge, audit cross-boundary call sites:
-X theirs or explicit
git checkout origin/main)Real examples from PR #4291:
multi_latent_attention.py (main) called off_interface.group_commit()
but dev's interface only had group_offload() — method renamedmamba_model.py (main) called init_chunk_handler(3 params) but dev's
interface required 6 params — signature expanded on devmamba_model.py called mark_not_offloadable() but dev had
mark_not_offload() — method renamedbulk_offload() did .remove() after bulk_offload_group() already
.pop()d the same item — double-removal from a listPractical detection:
# For each file taken from main, find what it imports and calls
grep -rn "from <module> import\|<module>\." megatron/
# Cross-reference with the actual implementations in the merged tree
These lessons were learned from PR #4291. They may recur if the same files continue to diverge:
gated_delta_net.py: If the merge creates code calling non-existent helper
methods (e.g. _resolve_cu_seqlens), take dev's version wholesale.model_chunk_schedule_plan.py: Watch for missing imports (e.g.
CudaGraphScope) silently dropped during conflict resolution.fine_grained_activation_offload.py: Critical interface file used by many
callers. If main and dev have divergent method names/signatures, prefer
dev's implementation and patch main-originated callers to match.distrib_optimizer.py: Dev may have broader type abstractions (e.g.
_is_distopt_quantized_param covering both FP8 and GroupedQuantizedTensor).
Main may simplify to explicit type checks. Restore dev's abstractions.Main and dev have completely different classes in this file:
HybridCPDataLoaderWrapper (imported by main's training.py)BasePackingScheduler, DpBalancedScheduler,
DefaultDynamicCPScheduler, wrap_data_iterator,
get_batch_on_this_rank_for_sequence_packing (imported by pretrain_gpt.py
and tests)Do NOT take either version wholesale. Keep dev's file and append main's
HybridCPDataLoaderWrapper class (plus any missing imports like
BalancedCPScheduler, Any, List) at the end.
Compare git ls-tree between origin/main and HEAD to find files in main
that are missing from the merged tree. For each:
hybrid_cp_schedule.py if data_schedule.py imports from it)git log origin/dev -- <file> for the deletion commit to understand intentRun on ALL changed Python files (relative to origin/dev), in this order:
black (version 24, --config pyproject.toml)isortpylint on changed megatron/core/ files — fix missing-docstring and
line-too-long violations before pushingBefore every git push in this workflow (the initial push in Phase 1
AND every fix-push in Phase 3), run these bash checks. If any fails,
fix the condition and re-check before pushing:
# 1. CODEOWNERS must be identical to dev's.
if ! git diff --quiet origin/dev -- .github/CODEOWNERS; then
echo "ABORT: .github/CODEOWNERS differs from origin/dev. Restore with:"
echo " git checkout origin/dev -- .github/CODEOWNERS"
exit 1
fi
# 2. Dependency-management triple must be identical to dev's.
for f in pyproject.toml uv.lock docker/Dockerfile.ci.dev; do
if ! git diff --quiet origin/dev -- "$f"; then
# pyproject.toml is allowed to differ ONLY for git source reconciliation
# (new [tool.uv.sources] entries from main). If you intentionally edited
# it for that reason, bypass this check by re-running with $f skipped.
echo "WARNING: $f differs from origin/dev"
fi
done
The CODEOWNERS check is a HARD abort — never push if it fails.
Phase 1 produces a single commit on the sync branch. The merge itself
creates the merge commit; fold any post-merge work (formatting,
conflict surgery, restored files, regenerated uv.lock) into it
rather than stacking a second commit:
git add -A
git commit --amend --no-edit # rewrites the merge commit's tree;
# parents are preserved.
git push -u origin "$BRANCH" # only non-force push of the run.
Once pushed, this commit is immutable for the rest of the run. Phase 3 fixes go into a separate rolling fix commit on top (see Phase 3 step 4 and the two-commit policy in Rules).
chore: nightly sync main into dev ($DATE)gh pr create --draftSummary of what was synced (number of commits from main)
Python-only line-change stats, so reviewers can gauge the real code surface (excluding golden-value JSON, uv.lock, etc.). Compute with:
git diff --numstat origin/dev...HEAD -- '*.py' \
| awk 'BEGIN{a=0;d=0} {a+=$1; d+=$2} END{
printf "Python lines: +%d / -%d across %d files\n", a, d, NR
}'
Include the exact line (e.g. Python lines: +1234 / -567 across 42 files)
in the PR body so reviewers see it at a glance.
List of files where main's version was taken over the merge
List of files that were deleted in dev but restored (and why)
The remerge-diff output (git show --remerge-diff HEAD on the merge
commit) so reviewers can inspect ONLY the conflict resolutions. If the
output is very long, summarize conflicts by file and put the full diff
in a collapsed <details> block. If git is too old for --remerge-diff,
note the git version and describe the merge strategy used instead.
Run functional tests and Run MBridge tests labels to the
PR immediately after creation. The Run functional tests label ensures
/ok to test triggers the full CI suite (unit tests + functional/
integration tests with 100-step training and golden value comparison).
The Run MBridge tests label triggers the MBridge test suite. Without
these labels, only a lightweight subset runs.
gh pr edit <PR_NUMBER> --repo $REPO \
--add-label "Run functional tests" \
--add-label "Run MBridge tests"
Nemo_CICD_Test is a downstream gate job aggregating unit test,
integration test, and other results. If it fails, investigate the upstream
jobs it depends on — do NOT debug the gate itself.Nemo_CICD_Test gate will fail as a result.tests/unit_tests/conftest.py imports from megatron.training.training,
so a broken import in training.py (or anything it transitively imports)
cascades to fail ALL test suites. If every test job fails with ImportError,
check the training.py import chain first.You run inside ONE GitHub Actions step. The moment you stop emitting tool calls, the step ends and the runner container is destroyed. Any background process you started dies with it. There is NO persistent session and NO future wakeup. See the workflow prompt's "NO background tasks" block for the full ban list.
Practical rule: every wait for CI to resolve is a SINGLE foreground Bash tool call that blocks inline until the wait is resolved.
Two nested loops. Do NOT conflate them:
/ok to test, one blocking poll, maybe one fix-and-push). It is NOT a
Bash loop. It advances because you make new tool calls.while true; do ... sleep 120; done. It runs during one iteration of
the outer loop and ends when CI reaches a terminal state for that
iteration.The outer loop terminates ONLY when Phase 4's gate is satisfied.
Source of truth: gh pr view <PR_NUMBER> --repo $REPO --json statusCheckRollup.
This lists every required check, including external status contexts
(GitLab CI, copy-pr-bot, etc.) that gh api .../actions/runs/.../jobs
does NOT show.
Outer-loop iteration (each iteration is a few tool calls):
latest_sha=$(git rev-parse HEAD) (one Bash call).
Post /ok to test $latest_sha on the PR:
gh pr comment <PR_NUMBER> --repo $REPO --body "/ok to test $latest_sha"
ONE blocking Bash tool call. This is the inner loop. Copy this
template verbatim, only changing REPO and PR:
REPO='NVIDIA/Megatron-LM'
PR='<PR_NUMBER>'
# Names matched case-insensitively, anchored to the START of the name.
EXEMPT='copy-pr-bot|is-not-external-contributor|greptile|coderabbit|codeowners|.*review|.*approval|codecov|coverage|build-docs|doc-build|readthedocs|sphinx'
# Sentinel check that tells us CI has fully run. Update this if the
# aggregate gate job is renamed.
SENTINEL='Nemo_CICD_Test'
while true; do
# Normalize both CheckRun (.status / .conclusion) and StatusContext
# (.state) entries into the same {name, status, conclusion} shape.
rollup=$(gh pr view "$PR" --repo "$REPO" --json statusCheckRollup --jq '
.statusCheckRollup[] | [
(.name // .context // "?"),
(if .__typename == "StatusContext" then
(if (.state == "PENDING" or .state == "EXPECTED") then "IN_PROGRESS"
else "COMPLETED" end)
else (.status // "UNKNOWN") end),
(if .__typename == "StatusContext" then
(if .state == "SUCCESS" then "SUCCESS"
elif (.state == "FAILURE" or .state == "ERROR") then "FAILURE"
else "NEUTRAL" end)
else (.conclusion // "UNKNOWN") end)
] | @tsv')
# Sentinel: do NOT declare green until the CI aggregate gate has
# reached a terminal state. Before /ok to test triggers the run,
# the sentinel is absent; while CI is running, it's IN_PROGRESS.
sentinel_line=$(printf '%s\n' "$rollup" | awk -F'\t' -v s="$SENTINEL" '$1 == s')
sentinel_status=$(printf '%s\n' "$sentinel_line" | awk -F'\t' 'NR==1 {print $2}')
if [ "$sentinel_status" != "COMPLETED" ]; then
echo "=== $(date -u) waiting for $SENTINEL (status: ${sentinel_status:-absent}) ==="
sleep 120
continue
fi
# Classify non-exempt checks (exempt list applied to the NAME only).
non_exempt=$(printf '%s\n' "$rollup" | awk -F'\t' -v p="^($EXEMPT)" 'tolower($1) !~ tolower(p)')
failed=$(printf '%s\n' "$non_exempt" | awk -F'\t' '$2 == "COMPLETED" && $3 !~ /^(SUCCESS|SKIPPED|NEUTRAL)$/')
pending=$(printf '%s\n' "$non_exempt" | awk -F'\t' '$2 != "COMPLETED"')
if [ -n "$failed" ]; then
echo "=== NON-EXEMPT FAILURES ==="
printf '%s\n' "$failed"
echo "RESULT=FAILURE"
exit 0
fi
if [ -n "$pending" ]; then
# Sentinel is COMPLETED but a non-exempt check is still pending —
# rare but possible. Keep waiting; do NOT ship.
echo "=== $(date -u) sentinel done but non-exempt checks still pending ==="
printf '%s\n' "$pending"
sleep 120
continue
fi
echo "=== ALL NON-EXEMPT CHECKS COMPLETED GREEN ==="
printf '%s\n' "$non_exempt"
echo "RESULT=GREEN"
exit 0
done
This Bash call blocks for as long as CI takes (minutes to hours). Do
NOT split it into many short polls interleaved with other tool calls
— that wastes --max-turns and creates windows where you could lose
track of the loop state.
Read the tool output:
RESULT=FAILURE: diagnose via
gh api repos/$REPO/actions/jobs/<JOB_ID>/logs (or the
external-context equivalent) and fix the code. The Phase 1
commit is immutable; fixes accumulate in a single rolling fix
commit on top of it:
git add -A
if git rev-parse --verify HEAD^2 >/dev/null 2>&1; then
# HEAD has two parents → still the Phase 1 merge commit.
# First failure of this run: create the fix commit.
git commit -m "fix: post-CI corrections"
git push origin "$BRANCH"
else
# HEAD is the existing fix commit → amend it.
git commit --amend --no-edit
git push --force-with-lease origin "$BRANCH"
fi
--force-with-lease (not --force): if a human pushed onto the
branch since the bot last fetched, the lease aborts the push
instead of clobbering them — fetch and decide what to do.
Start a new outer-loop iteration at step 1 with the new HEAD SHA.RESULT=GREEN: outer loop is done. Proceed to Phase 4.Why not wait-for-run-to-register first? gh pr comment with
/ok to test <sha> is handled by copy-pr-bot, which takes a few
seconds to trigger the CI run. The statusCheckRollup poll in step 3
will initially show checks in PENDING / QUEUED; that's fine — the
inner loop treats those as "keep waiting" and will see them advance as
CI progresses. No separate registration poll needed.
PENDING /
QUEUED / IN_PROGRESS on the HEAD SHA. A push is not a pass;
only a COMPLETED + green status is.gh api .../actions/runs/.../jobs alone as the gate
signal. External status contexts (GitLab CI pipelines, copy-pr-bot
status, etc.) do NOT appear there. Use statusCheckRollup.&, no nohup, no
run_in_background: true, no ScheduleWakeup. The GitHub Actions
step owns your shell; when the step ends, every background process
is killed and cannot resume.pull-request/<PR_NUMBER> branches.
The community bot manages those branches when it processes
/ok to test. Pushing to them directly breaks the CI trigger
mechanism. Always push to your own sync branch (e.g.
main2dev/<DATE>) instead.Run functional tests and Run MBridge tests
labels. Without Run functional tests, the internal GitLab
functional tests do not run; without Run MBridge tests, the
MBridge test suite does not run.gh api repos/$REPO/actions/jobs/<JOB_ID>/logsImportError, ModuleNotFoundError, FAILED,
would reformat, line-too-long, Tracebackblack --config pyproject.toml
on offending files. For pylint long-line or missing-docstring, edit directly.isort can reorder imports in a way that introduces
circular dependencies (e.g. megatron/legacy/model/__init__.py). Check
git diff on __init__.py files to see if import order changed.pyproject.toml/uv.lock
can change library versions in the CI container. Dev-only code may depend on
newer versions (e.g. TransformerEngine's single_grouped_weight). If failures
trace to missing kwargs or changed APIs in third-party libs, this is the cause.archive.ubuntu.com unreachable or Connection timed out during package
installation are transient CI infrastructure issues, not code problems.
Retry CI with the same SHA. Do not investigate as code failures.You MUST empirically verify before classifying any failure as pre-existing.
gh pr list --repo $REPO --base dev --state merged --limit 3gh pr checks <PR_NUMBER> --repo $REPO on a recently merged dev PRGitHub CI covers unit tests and some integration tests. Internal GitLab
(gitlab-master.nvidia.com) runs additional functional tests on
H100/GB200 hardware that may reveal issues GitHub CI does not catch.
These surface in statusCheckRollup as external status contexts (the
bash template already handles them via the __typename == "StatusContext"
branch).
gh pr readyRun gh pr ready ONLY when every non-exempt required check on the latest
CI run (against the current HEAD SHA) satisfies BOTH:
status == "completed" — NOT queued, in_progress, pending,
waiting, or requested.conclusion ∈ {"success", "skipped", "neutral"}.If a non-exempt check is pending/queued/in-progress: keep polling; do not
run gh pr ready. If it fails: go back to Phase 3's loop.
The exempt list (approval/coverage/docs) is defined in Phase 3; only those checks may be ignored.
A pre-existing failure (same test failing identically on recent dev CI) may be accepted, but ONLY after it has fully run, been empirically verified against dev, and documented in the PR body with evidence (dev PR number + CI run URL).
gh pr ready <PR_NUMBER> --repo $REPO
Then comment on the PR confirming it is ready for human review. The comment should include:
ALL NON-EXEMPT CHECKS COMPLETED GREEN output)pyproject.toml git source reconciliation performedgit commit --amend --no-edit +
git push --force-with-lease). Never modify the Phase 1 commit
after pushing it; never let the fix-commit count exceed one./ok to test <sha>pull-request/<PR_NUMBER>svcnvidia-nemo-ciisort on those files