with one click
cicd
// CI/CD reference for NeMo AutoModel — pipeline structure, commit and PR workflow, CI failure investigation, and common failure patterns.
// CI/CD reference for NeMo AutoModel — pipeline structure, commit and PR workflow, CI failure investigation, and common failure patterns.
| name | cicd |
| description | CI/CD reference for NeMo AutoModel — pipeline structure, commit and PR workflow, CI failure investigation, and common failure patterns. |
| when_to_use | Investigating a CI failure, understanding the pipeline structure, writing a commit or PR, triggering CI, 'CI is red', 'how do I trigger CI', 'PR workflow', 'where are the logs', 'CI did not run', '/ok to test'. |
All commits require DCO sign-off:
git commit -s -m "feat: add new recipe for Qwen2"
If sign-off is missing on a recent commit, amend it:
git commit --amend -s
Follow the PR template in .github/PULL_REQUEST_TEMPLATE.md. Every PR should
include:
PR title format: [{areas}] {type}: {description}
(e.g., [model] feat: Add Qwen3 VLM support)
See CONTRIBUTING.md for the full PR workflow, area/type labels, and DCO
requirements.
Use descriptive branch names prefixed with your username or a category:
username/feat_add_qwen2_recipe
fix/gradient_clip_nan
CI is triggered on push — not on pull_request. A bot called
copy-pr-bot controls when CI runs.
Mechanism:
copy-pr-bot watches for a trust signal./ok to test <commit-sha> as a PR comment → bot
triggers manually for that SHA.copy-pr-bot copies the PR's code into the remote branch
pull-request/<number> and pushes it, which fires CI.Consequences:
/ok to test.pull-request/<number>, not the author's
feature branch./ok to test <new-sha> is posted.lint-check
└── cicd-container-build
├── unit-tests-core
├── unit-tests-diffusion
└── functional-tests (L0 always; L1 with needs-more-tests label; L2 on schedule)
CI test scripts live in tests/ci_tests/. These are executed in the CI
pipeline and should not be run locally unless reproducing a CI failure.
# Extract PR number from branch name (e.g. pull-request/1234)
PR_NUMBER=$(git rev-parse --abbrev-ref HEAD | grep -oP '(?<=pull-request/)\d+')
gh pr view "$PR_NUMBER" --repo NVIDIA-NeMo/Automodel
gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Automodel --name-only
gh pr checks "$PR_NUMBER" --repo NVIDIA-NeMo/Automodel
gh pr diff "$PR_NUMBER" --repo NVIDIA-NeMo/Automodelgh pr checks output.gh run list --repo NVIDIA-NeMo/Automodel --branch "pull-request/$PR_NUMBER"
gh run view <run_id> --repo NVIDIA-NeMo/Automodel --log-failed > run.log
wc -l run.log
tail -200 run.log
sed -n '1,200p' run.log
| Symptom | Likely Cause | Action |
|---|---|---|
| CI never started | Commits not GPG-signed and no /ok to test | Post /ok to test <full-sha> on the PR |
| Lint job fails | ruff violation | Run ruff check --fix . && ruff format . locally |
| Unit tests fail | Code regression or missing import | Run failing test locally; check the PR diff |
| Functional test (L0) fails | Integration breakage | Check GPU runner logs |
| DCO sign-off missing | git commit run without -s | Amend: git commit --amend -s |
| Multi-GPU tests fail silently | CUDA_VISIBLE_DEVICES not set | Set CUDA_VISIBLE_DEVICES explicitly |
torchrun port conflict | Multiple processes sharing a port | Pass --master_port=<unused_port> or set MASTER_PORT |
Maintain the NeMo AutoModel Fern docs site under fern/ — add, update, move, or remove pages; manage redirects, slugs, navigation, and version aliases; run validation and previews.
Guide for onboarding new model families into NeMo AutoModel, including architecture discovery, implementation patterns, registration, and validation.
Dev environment setup for NeMo AutoModel — container-based development, uv package management, installation options, environment variables, and common build pitfalls.
Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings.
Configure NeMo AutoModel job launches for interactive runs, Slurm clusters, and SkyPilot cloud execution.
Code style and quality rules for NeMo AutoModel — ruff configuration, naming conventions, type hints, docstrings, copyright headers, and the code review checklist.