| name | training-campaign |
| description | Execute and monitor long-running RL training campaigns. Progress tracking, checkpoint management, experiment logging, and resume capabilities. |
Purpose
Long-running training management for VBot navigation:
- Execute multi-day training campaigns
- Checkpoint registry and resume
- Structured experiment logging
- Progress monitoring and alerts
š“ AutoML-First Policy (MANDATORY):
NEVER use train.py for parameter search or reward hypothesis testing.
ALWAYS use automl.py for batch search. See .github/copilot-instructions.md for the full policy.
train.py is ONLY for: smoke tests (<500K steps), --render visual debug, or final deployment runs.
Operational Guardrails:
- The AutoML pipeline is tested and working. Do NOT re-read
automl.py, train_one.py, or evaluate.py before launching.
- When asked to start/resume training, use the commands below directly.
- The pipeline handles import ordering, JSON serialization, and subprocess management internally.
Related Skills:
training-pipeline ā Hub with Quick Start commands (start here)
curriculum-learning ā Define curriculum plans
hyperparameter-optimization ā Search configurations
reward-penalty-engineering ā Reward exploration methodology
When to Use
| Task | Use This |
|---|
| Start training campaign | ā
|
| Resume interrupted run | ā
|
| Monitor progress | ā
|
| Checkpoint management | ā
|
| Review past experiments | ā
(see Step 0 below) |
| Design rewards | ā Use reward-penalty-engineering |
ā ļø Step 0: Review Before Starting
ALWAYS review existing experiments before starting new training. See training-pipeline skill ā "Step 0: Review Experiment History" for the full checklist.
Quick Review Commands
# Quick review: what training exists?
Get-ChildItem starter_kit_log/automl_* -Directory | Select-Object Name
Get-ChildItem runs/<env-name>/ -Directory | Sort-Object Name -Descending | Select-Object -First 5
# Check training progress of latest run
uv run starter_kit_schedule/scripts/check_training.py
Commands
Start Training
# === PRIMARY: AutoML pipeline (USE THIS for all parameter exploration) ===
uv run starter_kit_schedule/scripts/automl.py `
--mode stage `
--budget-hours 8 `
--hp-trials 15
# === SMOKE TEST ONLY (<500K steps, verify code compiles) ===
uv run scripts/train.py --env <env-name> --train-backend torch --max-env-steps 200000
# === VISUAL DEBUGGING ONLY ===
uv run scripts/train.py --env <env-name> --render
# === FINAL DEPLOYMENT RUN (after AutoML found best config) ===
uv run scripts/train.py --env <env-name> --train-backend torch
Monitor Progress
# Check AutoML state
Get-Content starter_kit_schedule/progress/automl_state.yaml
# TensorBoard (opens web dashboard)
uv run tensorboard --logdir runs/<env-name>
# List checkpoints
Get-ChildItem runs/<env-name>/ -Recurse -Filter "*.pt"
Evaluate
# Play latest checkpoint (visual)
uv run scripts/play.py --env <env-name>
# Play specific checkpoint
uv run scripts/play.py --env <env-name> `
--policy runs/<env-name>/<run_dir>/checkpoints/agent.pt
Directory Structure
starter_kit_schedule/
āāā templates/ # All YAML templates & config references
ā āāā automl_config.yaml # AutoML configuration template
ā āāā config_template.yaml # Individual training config
ā āāā curriculum_plan_template.yaml
ā āāā plan_template.yaml
ā āāā reward_config_template.yaml
ā āāā search_space_template.yaml
āāā scripts/
ā āāā automl.py # AutoML search engine (entry point)
ā āāā train_one.py # Single trial subprocess
ā āāā evaluate.py # Read TensorBoard for metrics
ā āāā monitor_training.py # Training monitor & TB analyzer
ā āāā eval_checkpoint.py # Checkpoint evaluator & ranker
ā āāā smoke_test.py # Smoke test & reward budget auditor
ā āāā check_training.py # Quick training progress checker
ā āāā progress_watcher.py # Generates WAKE_UP.md for agent context
āāā progress/
ā āāā automl_state.yaml # AutoML search state (primary tracking file)
ā āāā WAKE_UP.md # Generated by progress_watcher for agent context
āāā checkpoints/
ā āāā registry.yaml # All checkpoints index
āāā reward_library/ # Archived reward/penalty components
starter_kit_log/
āāā <automl_id>/ # Self-contained per-run folder
āāā configs/ # HP + reward configs per trial
āāā experiments/ # Per-experiment summaries
āāā index.yaml # Run-level index
āāā state.yaml # AutoML state snapshot
runs/ # Training outputs
āāā <env-name>/
āāā <timestamp>_PPO/
āāā checkpoints/ # Policy checkpoints
āāā events.out.tfevents.* # TensorBoard logs
āāā experiment_meta.json # HP config snapshot
AutoML Pipeline Architecture
The AutoML pipeline runs as a single process that spawns subprocesses:
run.py (entry point, sets --env <env-name>)
āāā automl.py (HP search engine)
āāā sample_from_space() ā HP config (native Python types)
āāā _train_and_eval() ā spawns subprocess:
ā āāā train_one.py (imports vbot FIRST, then motrix_rl)
ā āāā Trainer(env_name, cfg_override=rl_overrides).train()
āāā evaluate.py ā reads TensorBoard event files
ā āāā Returns: final_reward, max_reward, distance_to_target
āāā Saves state to: starter_kit_schedule/progress/automl_state.yaml
Expected Training Times
| Hardware | 50M Steps | 100M Steps |
|---|
| RTX 3090 | ~4 hours | ~8 hours |
| RTX 4090 | ~2.5 hours | ~5 hours |
| A100 | ~1.5 hours | ~3 hours |
Troubleshooting
| Issue | Solution |
|---|
| Training stuck | Check GPU memory, reduce num_envs |
| OOM error | Reduce num_envs or mini_batches |
| Resume fails | Check current_run.yaml for last checkpoint |
| Metrics missing | Check metrics.jsonl write permissions |
| Lazy robot at long training | Anti-laziness mechanisms disabled or arrival_bonus too small. See reward-penalty-engineering Lazy Robot case study |
| Reward looks good but robot not navigating | Check distance + reached% metrics, not just reward. High reward can come from alive_bonus accumulation |
Best Practices
- Checkpoint every 500-1000 iters - Training can be interrupted
- Use separate log directories - One per experiment
- Monitor GPU memory - Set alerts at 90% usage
- Version control configs - Store templates in
templates/
- Back up best checkpoints - Before advancing stages
- Use
--resume liberally - Don't restart from scratch