| name | training-smoke-test |
| description | This skill should be used when the user asks to "create a verification script", "write a test training run", "make a quick training job", "verify my change with a short run", "launch a smoke test", or wants a small Beaker job to validate that a feature works end-to-end on GPUs. Also triggers when the user mentions creating a modified 190M script to test a specific behavior.
|
Training job smoke test
Create short, single-node training scripts that exercise a specific feature in a
real GPU run. These scripts are derived from the 190M base config
(src/scripts/train/OLMo3/OLMo-3-190M.py) and train for only ~20 steps with an
eval callback to confirm the feature works.
When to use
- After implementing a new training feature (parallelism mode, eval type, data
pipeline change) that needs GPU validation beyond unit tests.
- When a CI test cannot cover the behavior (e.g., multi-GPU distributed features).
- To produce a quick smoke-test script that can be launched on Beaker.
Workflow
1. Identify what to verify
Determine the feature under test and what "success" looks like. Examples:
| Feature | Success signal |
|---|
| CP perplexity evals | LM evaluator reports CE loss and PPL without error |
| TP training | Training completes 20 steps, loss decreases |
| New data mix | Data loader produces batches without errors |
| New attention backend | Training completes 20 steps with the new backend, no kernel errors |
2. Copy and modify the 190M base script
Start from src/scripts/train/OLMo3/OLMo-3-190M.py and place the new script in
src/scripts/train/smoketests/.
Key modifications:
- File name:
<feature>-test.py in src/scripts/train/smoketests/.
- Docstring: Describe what is being verified and how to run it.
- Duration:
Duration.steps(20) โ just enough to confirm the feature works.
- Batch size:
SEQ_LENGTH * 16 โ small but enough for distributed training.
- Metrics/cancel interval: 5 steps (frequent enough to catch issues early).
- Remove unused features: Drop
Float8Config, InstanceFilterConfig,
CheckpointerCallback, FOR_BENCHMARKING, CHINCHILLA_MULTIPLE,
estimate_lr() none of these are needed for a quick verification.
- WandB: Keep but set
enabled=False.
3. Add the feature-specific config
Depending on what is being verified, add the relevant config. Common patterns:
Context Parallelism (CP):
from olmo_core.train.train_module import TransformerContextParallelConfig
cp_config=TransformerContextParallelConfig.ulysses(degree=2),
Tensor Parallelism (TP):
from olmo_core.train.train_module import TransformerTensorParallelConfig
tp_config=TransformerTensorParallelConfig(degree=2),
New attention backend:
from olmo_core.nn.attention import AttentionBackendName
model_config = TransformerConfig.olmo3_190M(
vocab_size=tokenizer_config.padded_vocab_size(),
attn_backend=AttentionBackendName.flash_4,
)
Available backends: torch, flash_2, flash_3, flash_4, te.
The base script selects between flash_2 and flash_3 based on GPU type;
a verification script can hardcode a specific backend to test it.
PPL evals (LMEvaluator):
from olmo_core.data import NumpyPaddedFSLDatasetConfig
from olmo_core.train.callbacks import LMEvaluatorCallbackConfig
.with_callback(
"lm_evaluator",
LMEvaluatorCallbackConfig(
eval_dataset=NumpyPaddedFSLDatasetConfig.from_data_mix(
DataMix.v3_small_ppl_validation,
mix_base_dir=get_root_dir(cli_context.cluster),
sequence_length=SEQ_LENGTH,
tokenizer=tokenizer_config,
work_dir=work_dir,
),
eval_interval=10,
),
)
Full recommended evals (PPL + downstream):
trainer_config = trainer_config.with_recommended_evals(
tokenizer_config, SEQ_LENGTH, cli_context.cluster, task_set="fast"
)
Note: downstream evals require full logits and are incompatible with CP or TP.
4. Launch and verify
Before launching, ask the user what priority to use for the Beaker job.
The options are: low, normal, high, urgent. Not all workspaces support
all priorities (e.g., ai2/OLMo_3 caps at normal).
python src/scripts/train/smoketests/<feature>-test.py \
dry_run test-<feature> ai2/jupiter
python src/scripts/train/smoketests/<feature>-test.py \
launch test-<feature> ai2/jupiter \
--launch.priority=<priority> \
--launch.follow=false
Always launch with --launch.follow=false (or set --launch.launch_timeout=<seconds>
to cap how long to wait). This avoids blocking the terminal indefinitely.
By default the launch command follows the job logs in the terminal until
completion, which requires step_soft_timeout and blocks the session. Using
follow=false lets the job run asynchronously.
After launching, report the Beaker experiment link to the user so they can
monitor the job.
To reconnect to a running job later:
gantry follow <experiment-id>
5. Confirm success
Check the Beaker logs for:
- Training steps completing without errors.
- Throughput / MFU at expected levels.
- Eval metrics being reported (CE loss, PPL for LM evals).
- No NCCL or distributed errors.
Naming convention
Scripts live in src/scripts/train/smoketests/ and are named <feature-slug>-test.py.
Examples:
src/scripts/train/smoketests/cp-ppl-eval-test.py
src/scripts/train/smoketests/tp-training-test.py
src/scripts/train/smoketests/cp-tp-combined-test.py
src/scripts/train/smoketests/new-data-mix-test.py
src/scripts/train/smoketests/flash4-attn-test.py