ワンクリックで
launcher-config
// Configure NeMo AutoModel job launches for interactive runs, Slurm clusters, and SkyPilot cloud execution.
// Configure NeMo AutoModel job launches for interactive runs, Slurm clusters, and SkyPilot cloud execution.
Maintain the NeMo AutoModel Fern docs site under fern/ — add, update, move, or remove pages; manage redirects, slugs, navigation, and version aliases; run validation and previews.
Guide for onboarding new model families into NeMo AutoModel, including architecture discovery, implementation patterns, registration, and validation.
Dev environment setup for NeMo AutoModel — container-based development, uv package management, installation options, environment variables, and common build pitfalls.
CI/CD reference for NeMo AutoModel — pipeline structure, commit and PR workflow, CI failure investigation, and common failure patterns.
Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings.
Code style and quality rules for NeMo AutoModel — ruff configuration, naming conventions, type hints, docstrings, copyright headers, and the code review checklist.
| name | launcher-config |
| description | Configure NeMo AutoModel job launches for interactive runs, Slurm clusters, and SkyPilot cloud execution. |
| when_to_use | Configuring Slurm or SkyPilot job submission, setting up multi-node launch scripts, debugging job submission failures, or switching between interactive and cluster launch modes. |
NeMo AutoModel supports three launch methods: interactive (torchrun), Slurm (HPC clusters), and SkyPilot (cloud-agnostic).
# Single GPU
automodel finetune llm -c config.yaml
# Multi-GPU (all GPUs on current node)
torchrun --nproc_per_node=8 -m nemo_automodel._cli.app finetune llm -c config.yaml
No additional YAML section is needed for interactive mode. The CLI routes to torchrun automatically when no slurm: or skypilot: section is present in the config.
The SlurmConfig dataclass generates an SBATCH script from a template.
slurm:
job_name: llm_finetune
nodes: 2
ntasks_per_node: 8
time: "04:00:00"
account: my_account
partition: batch
container_image: nvcr.io/nvidia/nemo:dev
hf_home: ~/.cache/huggingface
extra_mounts:
- source: /data
dest: /data
env_vars:
WANDB_API_KEY: "${WANDB_API_KEY}"
HF_TOKEN: "${HF_TOKEN}"
job_name: Slurm job identifiernodes: number of nodes to requestntasks_per_node: number of tasks (GPUs) per nodetime: wall-time limit in HH:MM:SS formataccount, partition: Slurm scheduling parameterscontainer_image: Enroot/Pyxis container image pathnemo_mount: mount point for NeMo AutoModel source inside the containerhf_home: HuggingFace cache directory pathextra_mounts: list of VolumeMapping(source, dest) for additional container bind mountsmaster_port: port for distributed communication (default 13742)env_vars: environment variables passed into the jobnsys_enabled: when true, wraps the training command with nsys profile for Nsight Systems profilingThe SkyPilotConfig dataclass defines cloud job parameters.
skypilot:
cloud: aws
accelerators: "H100:8"
num_nodes: 2
use_spot: true
disk_size: 200
region: us-east-1
setup: "pip install nemo-automodel"
env_vars:
HF_TOKEN: "${HF_TOKEN}"
cloud: target cloud provider (aws, gcp, azure, lambda, kubernetes)accelerators: GPU type and count (e.g., "H100:8", "A100-80GB:4")num_nodes: number of cloud instancesuse_spot: use preemptible/spot instances for cost savingsdisk_size: disk size in GB per noderegion: cloud region for instance placementsetup: shell commands to run before the training job (e.g., install dependencies)env_vars: environment variables for the jobFor multi-node training (both Slurm and SkyPilot), the launcher automatically configures:
MASTER_ADDR: hostname of the first nodeMASTER_PORT: port for rendezvous (default 13742)WORLD_SIZE: total number of processes (nodes * ntasks_per_node)Enable Nsight Systems profiling in Slurm jobs:
slurm:
nsys_enabled: true
This wraps the training command with nsys profile, producing a .nsys-rep file for performance analysis.
components/launcher/slurm/config.py - SlurmConfig dataclass, VolumeMappingcomponents/launcher/slurm/template.py - SBATCH script template generationcomponents/launcher/slurm/utils.py - Slurm submission utilitiescomponents/launcher/skypilot/config.py - SkyPilotConfig dataclass_cli/app.py - CLI entry point and launcher routing logicmaster_port (13742) is in use by another job on the same node, change it to avoid connection failures.source path in extra_mounts must exist on all nodes in the allocation. Missing paths cause container startup failures.use_spot: true) may be preempted by the cloud provider. Enable checkpointing with short intervals to minimize lost work.${VAR} syntax in YAML for shell variable expansion. Bare variable names will not be expanded.time limit is too short, an in-progress async checkpoint write may be killed before completion, resulting in a corrupted checkpoint. Leave at least 5-10 minutes of margin.