Run any Skill in Manus with one click

terra-rl

Stars1

Forks0

UpdatedMay 24, 2026 at 09:01

Run, validate, debug, and document Terra RL training in the terra and terra-baselines repositories. Use for Euler Slurm jobs, W&B run comparison, PPO/default solo-excavator configs, JAX/CUDA runtime preflight, first-update smoke tests, experiment ledgers, and diagnosis of slow compilation or GPU training failures.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

Idate96

Idate96/codex_skills

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

File Explorer

5 files

SKILL.md

readonly

Terra RL

Use this skill for Terra policy training work spanning:

/home/lorenzo/moleworks/terra
/home/lorenzo/moleworks/terra-baselines
Euler workspace /cluster/home/lterenzi/codex_terra_edge_validation
W&B project aless-weber-eth/mixed-agents

Core Rule

Do not treat jax.devices() as a sufficient GPU preflight. Before launching expensive training, prove the CUDA runtime can execute the paths the trainer needs:

venv CUDA library paths include cuDNN, CUPTI, cuBLAS, NVRTC, and NCCL.
JAX sees the expected GPU count.
A tiny jitted conv backward pass completes on GPU.
For multi-GPU jobs, a pmap all-reduce completes through NCCL.
A training smoke either completes update 1 or is explicitly recorded as not a passed gate.

Use scripts/euler/check_jax_runtime.py inside the Slurm allocation after exporting the same environment that the training job will use.

Environment Selection

Use these environments by default:

Local CPU gates: /home/lorenzo/moleworks/.venv-terra-uv. This is the canonical uv/venv for syntax checks, CPU action-mask/state gates, and small non-GPU probes. Set JAX_PLATFORMS=cpu.
Local 24 GB GPU tests: /home/lorenzo/moleworks/.venv-terra-gpu-uv. Use this for one-GPU first-update smoke tests, CUDA runtime checks, and env-capacity sweeps on the local RTX 4090. Do not use the CPU-only .venv-terra-uv to decide GPU memory fit.
Euler training: WORK=/cluster/home/lterenzi/codex_terra_edge_validation with VENV=/cluster/scratch/lterenzi/codex_terra_edge_venv. Use this for real Slurm jobs and W&B training runs.

Always set:

export PYTHONPATH=/home/lorenzo/moleworks/terra:/home/lorenzo/moleworks/terra-baselines:${PYTHONPATH:-}

or the equivalent Euler WORK paths before invoking train_mixed.py or validation scripts.

Euler GPU Selection

Use only NVIDIA GeForce RTX 3090 or NVIDIA GeForce RTX 4090 nodes for Terra RL training unless the user explicitly asks for a different GPU. Treat all other GPU types as invalid for baseline or mask validation runs, including Quadro RTX 6000, Titan RTX, RTX 2080 Ti, A100, and RTX Pro 6000.

Known Euler node families:

RTX 3090: eu-g4-[001-032]
RTX 4090: eu-g6-[001-080]

Do not use generic GPU requests such as #SBATCH --gpus=gpu:4, #SBATCH --gpus=4, or the old train_cluster.sh style request by itself. gpuhe.* contains multiple GPU families, and generic requests have landed Terra jobs on Quadro RTX 6000 and RTX 2080 Ti nodes. If using a generic count, combine it with a 3090/4090 node family restriction, for example:

#SBATCH --partition=gpuhe.120h
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gpus=4
#SBATCH --nodelist=eu-g4-[001-032],eu-g6-[001-080]

For a single family, use only that node list:

# 3090-only
#SBATCH --nodelist=eu-g4-[001-032]

# 4090-only
#SBATCH --nodelist=eu-g6-[001-080]

Every Slurm script must hard-fail before the JAX preflight and before W&B training if the actual allocation is not 3090/4090-only. Put this guard after loading modules and before check_jax_runtime.py:

EXPECTED_GPUS="${EXPECTED_GPUS:-4}"
GPU_NAMES="$(nvidia-smi --query-gpu=name --format=csv,noheader)"
echo "Allocated GPUs:"
echo "$GPU_NAMES"

GPU_COUNT="$(printf "%s\n" "$GPU_NAMES" | sed '/^$/d' | wc -l)"
if [ "$GPU_COUNT" -ne "$EXPECTED_GPUS" ]; then
  echo "Expected $EXPECTED_GPUS GPUs, got $GPU_COUNT" >&2
  exit 42
fi

if printf "%s\n" "$GPU_NAMES" | grep -Evq 'NVIDIA GeForce RTX (3090|4090)$'; then
  echo "Refusing to train on non-3090/4090 GPU allocation" >&2
  exit 43
fi

After submission, verify the allocation with both:

sacct -j "$JOB_ID" --format=JobID,State,Elapsed,NodeList%20,ReqTRES%80,AllocTRES%80 -P
tail -n 80 "$LOG_PATH"

If AllocTRES or nvidia-smi shows anything other than nvidia_geforce_rtx_3090 or nvidia_geforce_rtx_4090, cancel the job and relaunch. Do not count runtime preflight or W&B history from a disallowed GPU allocation as validation evidence.

Launch Workflow

Check both repos with git status --short --branch; never revert unrelated dirty files.
If copying to Euler, sync only intended changes into the isolated workspace.
Run fast isolated gates before any Slurm launch:
- python3 -m py_compile for changed Python files.
- scripts/validation/validate_edge_mask_changes.py --case training-accounting --jax-platforms cpu
- scripts/validation/validate_edge_mask_changes.py --case ppo-mask --jax-platforms cpu
- --case model-policy --jax-platforms cpu
- --case model-edge-no-mask --jax-platforms cpu
- --case model-critic-affordance-shapes --jax-platforms cpu
- --case checkpoint-config-restore --jax-platforms cpu
- --case timeout-bootstrap-value --jax-platforms cpu
- --case gae-timeout-bootstrap --jax-platforms cpu
- --case state-action-mask --jax-platforms cpu --disable-jit
- --case state-step-dispatch --jax-platforms cpu --disable-jit
- --case synthetic-env-action-mask --jax-platforms cpu --disable-jit
- --case env-action-mask --jax-platforms cpu --dataset-path /home/lorenzo/moleworks/terra_data/train --dataset-size 1
- --case synthetic-step-fast-reset --jax-platforms cpu --disable-jit
- --case synthetic-batch-step-fast-reset --jax-platforms cpu
- --case env-episode-progress --jax-platforms cpu --disable-jit
Select only RTX 3090/4090 nodes and include the hard GPU-type guard from this skill in the Slurm job before any JAX or W&B training command.
Run the GPU runtime preflight from this skill in the Slurm job before training.
For new runtime/config changes, run a first-update smoke with W&B disabled before a full run.
Launch the production job only after the smoke evidence is real: update 1 completed, not only dataset/model initialization.
For A/B comparisons, verify each side independently past update 1. Do not count a Slurm RUNNING job as healthy until the log shows completed updates.
For performance/architecture work, ask Oracle with the relevant diffs and measured logs, not only at final review. Ask it to look for missed code hotspots, algorithmic options, profiling strategy, data layout changes, JAX/XLA architecture issues, and RL-level speed/learning tradeoffs; then verify recommendations locally before retaining changes.
Keep docs/EXPERIMENTS_RUNNING.md and docs/EXPERIMENTS_LOG.md current with job ids, W&B ids, exact failure signatures, and whether W&B has real history.

Current ResMap64 R1/R2 Launch

For the current combined-fix run, use the paired WIP trees and the prepared script:

Local Terra tree: /home/lorenzo/moleworks/terra_mask_wip
Local baselines tree: /home/lorenzo/moleworks/terra-baselines_mask_wip
Euler workspace: /cluster/home/lterenzi/codex_terra_edge_validation
Script: scripts/euler/terra_train_resmap64_phase_4gpu.sbatch

The intended architecture and semantics are:

ResMap64 delayed-downsample encoder: --map_encoder resnet_delayed --map_feature_dim 128 --use_map_derived_channels
Separate actor/critic trunks: --separate_actor_critic_trunks
Action mask disabled: --disable_action_mask
Critic-only affordances: --use_critic_affordances --include_episode_progress --edge_features_dim 10
PPO shape: --num_devices 4 --num_envs_per_device 1024 --num_steps 32 --num_minibatches 32
Time limit: current user request is to run this in gpuhe.120h with 5-00:00:00.
Timeout phase: randomize initial env_steps/episode_progress once after startup reset. This desynchronizes max-step timeouts while leaving normal reset behavior unchanged.

Do not add actor-visible affordances to this run. Anything actor-visible must be deployable on the real robot with the same semantics; keep that as a later ablation.

Timeout handling for this run should be:

Preserve final_observation through env reset.
Bootstrap max-step truncations from the pre-reset final observation.
Do not bootstrap true task terminals.
Stop recursive GAE at any reset boundary so reset-episode advantages do not leak backward.
Stagger initial episode ages so max-step timeouts do not arrive as one synchronized rollout step.
Pay terminal success reward only on task_done, never on max-step timeout. A timeout with partial completion may have progress rewards, but must not receive the terminal bonus.

Before submitting this script from a fresh work session, verify the local gates include:

python3 -m py_compile \
  /home/lorenzo/moleworks/terra_mask_wip/terra/env.py \
  /home/lorenzo/moleworks/terra_mask_wip/terra/state.py \
  /home/lorenzo/moleworks/terra-baselines_mask_wip/train.py \
  /home/lorenzo/moleworks/terra-baselines_mask_wip/train_mixed.py \
  /home/lorenzo/moleworks/terra-baselines_mask_wip/utils/models.py \
  /home/lorenzo/moleworks/terra-baselines_mask_wip/utils/utils_ppo.py \
  /home/lorenzo/moleworks/terra-baselines_mask_wip/scripts/validation/validate_edge_mask_changes.py

JAX_PLATFORMS=cpu PYTHONPATH=/home/lorenzo/moleworks/terra_mask_wip:/home/lorenzo/moleworks/terra-baselines_mask_wip \
  /home/lorenzo/moleworks/.venv-terra-uv/bin/python \
  /home/lorenzo/moleworks/terra-baselines_mask_wip/scripts/validation/validate_edge_mask_changes.py \
  --case all --jax-platforms cpu \
  --dataset-path /home/lorenzo/moleworks/terra_data/train --dataset-size 1

The latest local reviewed gate on 2026-05-15 for the staggered-timeout patch passed a one-update RTX 4090 smoke with this shape in 432.00s at 79.36 steps/s. Do not skip the script's own Euler W&B-disabled smoke; local smoke does not prove multi-GPU NCCL, allocation type, or cluster CUDA paths.

When monitoring the first online run, check these before calling it healthy:

Slurm allocation is exactly four RTX 3090/4090 GPUs.
check_jax_runtime.py --min-devices 4 passed.
The script's W&B-disabled full-shape smoke completed update 1.
The online W&B run id and URL are in docs/EXPERIMENTS_RUNNING.md.
First online update completed, not just model/dataset initialization.

For the value-spike question, watch these together:

value_loss, entropy, explained_variance, and sched/entropy_coef
train/done_rate, train/task_done_rate, train/timeout_rate
train/mean_episode_progress, train/max_episode_progress
train/value_loss_timeout_bucket_*
train/explained_variance_timeout_bucket_*
train/timeout_bucket_*_count
affordance/do_valid_rate, affordance/valid_action_fraction, affordance/legal_edge_in_cone, affordance/blocked_edge_in_cone, and affordance/completion

Do not interpret empty timeout buckets as stable value learning; check the bucket counts.

Current Default Solo Excavator Run

Use this as the baseline unless the user asks for another preset:

export DATASET_PATH=/cluster/project/rsl/alesweber/TerraProject/terra/data/terra/train
export DATASET_SIZE=600

python train_mixed.py \
  --config solo_excavator \
  --num_devices 4 \
  --num_envs_per_device 1024 \
  --num_steps 32 \
  --update_epochs 2 \
  --num_minibatches 16 \
  --total_timesteps 50000000000 \
  --log_train_interval 1 \
  --log_eval_interval 100 \
  --checkpoint_interval 100 \
  --eval_episodes 100

Effective defaults:

Agent/action: agent_types=[0], action_types=[0] tracked excavator.
Map: foundations_real_ring, DATASET_SIZE=600, max_steps=550.
Total envs: 4 * 1024 = 4096.
Env steps/update: 32 * 4096 = 131072.
Corrected update count for 50B global timesteps: 381469. Older code divided by num_devices twice and printed 95367, which only executed about 12.5B actual global env steps.
PPO: lr=3e-4, gamma=0.9984, gae_lambda=0.95, clip_eps=0.2, ent_coef=0.06, vf_coef=2.0, max_grad_norm=0.5.
Entropy schedule: 0.15 -> 0.005 over 9500 updates.
Rewards: dump_bonus_mult=0.5, excavator_relocate_dumped_mult=1.5, excavator_relocate_dug_dirt_mult=1.5.
PPO action masking is disabled by default; use --enable_action_mask only for explicit coarse action-availability mask experiments until the full-shape masked PPO path is stable.

Reward Guidance

Treat the default reward setup as the baseline while validating runtime, action masking, and PPO stability. Do not mix reward-weight changes into mask/no-mask or runtime A/B tests unless the user explicitly asks; otherwise the comparison stops answering the original question.

Use rewards diagnostically, not as the only success criterion:

eval/rewards rising while eval/max_reward < 1 usually means dense-reward progress, not task completion.
eval/DO rising without eval/positive_terminations can mean the policy learned to dig/dump activity but not finish the excavation objective.
High eval/DO_NOTHING % means the policy is still dithering or avoiding committed actions.
progress/episode_completion_rate can include timeouts/failures, so it is not a substitute for eval/positive_terminations.

Reward finetuning should come after a default-config policy shows repeatable terminal success. Then start from the best checkpoint and change one reward knob at a time, logging the parent checkpoint, diff, W&B run id, and expected behavioral effect in the experiment ledger. Candidate later finetunes include terminal reward backfill, dump/relocation multipliers, and phase- or edge-specific shaping, but do not use these to rescue an unvalidated training setup.

W&B Success Calibration

Use aless-weber-eth/mixed-agents as the historical comparison project. Prefer solo_excavator runs as the apples-to-apples baseline; trench_excavator and trench_masked_excavator are useful upper-bound references but use different task configs.

Closest successful solo_excavator references:

o9aewzsx: 4 GPUs, 1024 envs/device, eval/positive_terminations=9.56, eval/rewards=0.174, eval/max_reward=6.87, eval/DO=0.343, DO_NOTHING ~= 0.
hcwvorkm: 4 GPUs, 1024 envs/device, eval/positive_terminations=3.45, eval/rewards=0.157, eval/max_reward=6.87, eval/DO=0.311, DO_NOTHING=0.018.
o8bpdoex: 4 GPUs, 128 envs/device, eval/positive_terminations=4.01, eval/rewards=0.180, eval/max_reward=6.87, eval/DO=0.316, DO_NOTHING=0.011. This run started with nonzero success, so treat it as a performance reference, not a clean from-scratch learning curve.

Strong non-solo references: xjncfmr6, jnwj5tnj, and rov40bmt reached eval/positive_terminations ~= 26-31, eval/rewards ~= 0.37-0.43, eval/max_reward ~= 6.87, and eval/DO ~= 0.33-0.35.

Historical replay compatibility:

Do not judge old healthy policies only in the current multi-agent worktrees. hcwvorkm fails under current Terra replay (0 positive terminations in a 64-env local check) even with action masking disabled, but replays healthily with training-era code.
For hcwvorkm, use Terra commit de698b7f and terra-baselines commit 8091d3e when trying to reproduce W&B behavior locally. A local training-era eval on 2026-05-11 with 64 envs, 550 steps, seed 123 produced positive_terminations_per_env=7.5625, reward=0.1226, max_reward=6.8679, DO=0.2706, and DO_NOTHING=0.0762.
The single-env training-era GIF /home/lorenzo/moleworks/terra-baselines/logs/local_policy_play/hcwvorkm_training_era_seed123.gif terminated successfully in 61 steps with return about 8.41.
Old checkpoints do not store use_action_mask. When unpickled with current classes, plain getattr(config, "use_action_mask", True) silently inherits the new class default. For legacy replay, infer whether the field was saved on the instance; if it was missing, default to False to match the historical no-mask policy/eval path.

Interpret W&B signals as follows:

First terminal signal: eval/max_reward >= 6.8 and eval/positive_terminations > 0.
A tiny one-off positive termination can be luck. Call it real only if it persists across several evals and has plausible eval/avg_positive_episode_length.
Healthy learning: eval/rewards > 0.01, then trending toward 0.05+; eval/DO climbing toward 0.25-0.35; eval/DO_NOTHING % falling below about 0.05.
Strong solo target: eval/positive_terminations >= 3, eval/rewards ~= 0.15-0.18, eval/DO ~= 0.31-0.34, and eval/DO_NOTHING % <= 0.02.
No-success signature: eval/positive_terminations == 0, eval/max_reward < 1, eval/DO ~= 0.05-0.10, and eval/DO_NOTHING % ~= 0.20+.

eval/positive_terminations is normalized by num_envs_per_device * num_devices, so it is comparable across GPU counts; it can exceed 1 because multiple successful episodes may happen per eval environment over the eval horizon. progress/episode_completion_rate is not the same as task success because it includes ordinary done/timeouts; use eval/positive_terminations plus eval/max_reward for clean success calls.

Euler Runtime

Use the current isolated workspace unless the user explicitly requests another one:

WORK=/cluster/home/lterenzi/codex_terra_edge_validation
VENV=/cluster/scratch/lterenzi/codex_terra_edge_venv
module load stack/2024-06 cuda/12.1.1

SITE_PACKAGES=$("$VENV/bin/python" - <<'PY'
import site
print(site.getsitepackages()[0])
PY
)

export PYTHONPATH="$WORK/terra:$WORK/terra-baselines:${PYTHONPATH:-}"
export XLA_PYTHON_CLIENT_PREALLOCATE=false
export LD_LIBRARY_PATH="$SITE_PACKAGES/nvidia/cudnn/lib:$SITE_PACKAGES/nvidia/cuda_cupti/lib:$SITE_PACKAGES/nvidia/cublas/lib:$SITE_PACKAGES/nvidia/cuda_nvrtc/lib:$SITE_PACKAGES/nvidia/nccl/lib:${LD_LIBRARY_PATH:-}"

Known package pins for this stack:

python -m pip install --force-reinstall --no-cache-dir \
  numpy==1.26.4 scipy==1.12.0 ml-dtypes==0.5.4 opt-einsum==3.3.0 \
  chex==0.1.86 orbax-checkpoint==0.5.16 nest_asyncio
python -m pip install --no-cache-dir \
  nvidia-cudnn-cu12==8.9.7.29 nvidia-cuda-cupti-cu12==12.1.105 \
  nvidia-nccl-cu12==2.19.3

Run the preflight in the allocation:

"$VENV/bin/python" "$WORK/terra-baselines/scripts/euler/check_jax_runtime.py" --min-devices 4

For one-GPU comparison jobs, use --min-devices 1.

For a run that previously hit cuDNN autotune failures, export the mitigation before training:

export XLA_FLAGS="${XLA_FLAGS:+$XLA_FLAGS }--xla_gpu_autotune_level=0"

This can cost a few percent of steady throughput, so compare learning by W&B step/update rather than wall-clock unless both sides use the same XLA flags.

Local Environments

Use this local CPU-only environment for quick gates:

source /home/lorenzo/moleworks/.venv-terra-uv/bin/activate
export JAX_PLATFORMS=cpu
export PYTHONPATH=/home/lorenzo/moleworks/terra:/home/lorenzo/moleworks/terra-baselines

For local RTX 4090 / 24 GB GPU tests, use the separate CUDA-enabled environment:

VENV=/home/lorenzo/moleworks/.venv-terra-gpu-uv
SITE_PACKAGES=$("$VENV/bin/python" - <<'PY'
import site
print(site.getsitepackages()[0])
PY
)
export PYTHONPATH=/home/lorenzo/moleworks/terra:/home/lorenzo/moleworks/terra-baselines
export XLA_PYTHON_CLIENT_PREALLOCATE=false
export XLA_PYTHON_CLIENT_MEM_FRACTION=0.95
export LD_LIBRARY_PATH="$SITE_PACKAGES/nvidia/cudnn/lib:$SITE_PACKAGES/nvidia/cuda_cupti/lib:$SITE_PACKAGES/nvidia/cublas/lib:$SITE_PACKAGES/nvidia/cuda_nvrtc/lib:$SITE_PACKAGES/nvidia/nccl/lib:${LD_LIBRARY_PATH:-}"

As of 2026-05-11, the local GPU venv's activate script may still point VIRTUAL_ENV and console-script shebangs at .venv-terra-uv because the env was copied from the CPU venv. For GPU checks, invoke "$VENV/bin/python" directly and verify jax.default_backend() is gpu; do not trust source "$VENV/bin/activate" alone.

If the GPU env is missing, create it from the known-good CPU env and install the matching CUDA JAX wheel before running capacity or training checks:

cp -a /home/lorenzo/moleworks/.venv-terra-uv /home/lorenzo/moleworks/.venv-terra-gpu-uv
/home/lorenzo/moleworks/.venv-terra-gpu-uv/bin/python -m pip install -U \
  "jax[cuda12]==0.4.26" \
  -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Verify the selected env explicitly:

"$VENV/bin/python" - <<'PY'
import jax
print(jax.devices())
assert jax.default_backend() == "gpu" or jax.default_backend() == "cuda"
PY

Local throughput settings after the 2026-05-11 fast-reset and pmap-donation optimizations:

Use --num_envs_per_device 1536 for peak local RTX 4090 / 24 GB throughput tests. It measured about 60.8k steady steps/s after pmap runner-state donation.
Use --num_envs_per_device 1280 when you want nearly peak local throughput with less memory pressure. It measured about 60.0k steady steps/s after donation.
1024 remains the safer default for Euler 24 GB RTX 3090/4090 runs and for comparability with existing W&B runs.
2048 is not the local throughput peak; it measured about 58.4k steady steps/s after donation. Do not increase env count unless the measured steps/s increases.
4096 and above are slower for training throughput on the local 4090; 5120 OOMed after fast-reset.

Performance Triage After Fast-Reset

After the 2026-05-11 fast-reset optimization, do not assume the environment wrapper is still the main limiter. The current local 1280-env evidence is:

wrap_state_only, state_to_obs_only, and action_mask_edge_only are about 255k-279k steps/s.
Random env-step hot repeats are about 64k-79k steps/s.
Policy-forward-plus-env-step is about 76k steps/s.
Full PPO default training was about 51k steps/s before pmap donation and about 60k steps/s after donation at the local peak.

First inspect PPO update/backprop/minibatching before changing more env code. Check the train_mixed.py path that swaps [steps, envs] to [envs, steps], shuffles envs with jnp.take, reshapes into minibatches, and runs ppo_update_networks.

Keep the multi-GPU accounting fix:

config.num_envs already equals num_envs_per_device * num_devices.
env_steps_per_update should be num_steps * config.num_envs.
performance/steps_per_second should multiply iterations/s by env_steps_per_update, not by env_steps_per_update * num_devices.
num_updates should be total_timesteps // env_steps_per_update when total_timesteps is global env steps.
Old 4-GPU W&B performance/steps_per_second values from this code path should be divided by 4; local 1-GPU values are unaffected.

Keep pmap runner-state donation unless it fails a real training smoke:

_update_step should use jax.pmap(..., donate_argnums=(0,)) for runner_state.
This is valid because the loop overwrites runner_state after each update and does not reuse the donated input.
Local validation on 2026-05-11 passed a 16-env GPU smoke, 1024/1280/1536/2048 env sweeps, and the CPU edge-mask gates.

Do not retry the manual categorical replacement unless new profiling specifically implicates TFP. On 2026-05-11, a manual JAX categorical matched TFP log-prob/entropy but measured only about 61.1k steps/s at 1536 envs/GPU versus about 60.8k with TFP after donation. That is within noise and would change seeded sample streams, so it was reverted.

Do not retry flat env/time PPO minibatches as a generic cleanup. The 2026-05-11 probe passed a synthetic old-vs-flat PPO math parity check and a tiny GPU smoke, but measured only about 60.25k steps/s at 1536 envs/GPU, below the retained donation baseline, so it was reverted.

Do not retry a simple jnp.any(timestep.done) guard around curriculum_manager.update_cfgs as a generic env optimization. It passed selected no-done parity but measured only about 60.06k steps/s at 1536 envs/GPU, below the retained donation baseline, so it was reverted.

Oracle's 2026-05-11 PPO review concluded that no obvious large correctness-preserving PPO-side win remains after fast-reset and pmap donation. The next safe probes, if more optimization is needed, are packed model inputs/preclipped action maps, lean transition storage, host-transfer gating, and solo-agent model specialization. Treat num_minibatches=8, num_steps=64, and update_epochs=1 as semantic speed/learning experiments, not default config changes.

Oracle's 2026-05-11 architecture review agreed that easy correctness-preserving gains are mostly exhausted. The next ranked probes are: strengthen fast-reset partial/all-done parity, remove duplicated _is_done work between reward and step, add a value-only bootstrap path, sweep lax.scan unroll factors, profile forced action classes, and treat pmap -> shard_map/pjit as a larger architecture experiment.

Host-transfer gating is retained in train.py and train_mixed.py: unreplicate loss_info/runner_state only when train logging, checkpointing, eval, or final return needs host values. This mainly cleans up logging-disabled local sweeps; it is not a compiled-update steps/s win.

Do not retry packed model-input transition storage as a generic PPO update optimization. It passed exact raw-vs-packed PPO update parity and GPU smokes, but measured about 60.59k/59.71k/58.43k steps/s at 1536/1792/2048 envs/GPU, so it did not beat the retained 1536-env donation baseline and was reverted.

Do not retry lean transition storage as a generic cleanup. Setting unused transition leaves to None passed gates and a tiny GPU smoke, but measured about 60.41k steps/s at 1536 envs/GPU, below the retained donation baseline, so it was reverted.

Do not retry static solo-agent model specialization without a new profile. It produced tiny generic-vs-specialized output differences around 1e-10, passed a smoke, but measured about 60.60k steps/s at 1536 envs/GPU, below the retained donation baseline, so it was reverted.

Do not retry scalar RNG split relocation without a new profile. Letting TerraEnvBatch.step accept a scalar key and splitting only in reset/curriculum branches passed scalar-vs-batched no-done parity and a GPU smoke, but measured about 60.18k steps/s at 1536 envs/GPU, below the retained donation baseline, so it was reverted. Trainer/profile call sites should split per-env reset keys before env.step.

Do not retry value-only bootstrap as a generic cleanup. Replacing the post-rollout select_action_ppo bootstrap with a value-only model apply passed CPU gates and a tiny GPU smoke, but measured about 60.62k steps/s at 1536 envs/GPU, below the retained donation baseline, so it was reverted.

Do not retry reward/done reuse as a generic cleanup. Threading _get_reward's done/task_done result into step_no_reset passed CPU gates and a tiny GPU smoke, but measured about 60.54k steps/s at 1536 envs/GPU, below the retained donation baseline, so it was reverted.

Do not retry rollout jax.lax.scan(unroll=2) without a new reason. A 16-env local GPU smoke jumped to about 21 GB peak memory and slowed to about 12.4 steps/s, so the probe was reverted without a 1536-env run.

Use scripts/profile_rollout_components.py --sections forced_action_env_step_only when deciding whether to optimize action-specific env code. On 2026-05-11, forced-action profiling at 1280 envs/GPU showed hot repeats around cabin ~130k steps/s, do-nothing ~110k, and DO ~92k. DO is the slow branch, but still faster than full PPO, so only keep simple DO-specific changes. A cone-reuse probe in _handle_dig/_handle_dump did not improve forced DO and was reverted.

Use scripts/profile_rollout_components.py --sections build_update_inputs_only,ppo_update_only,full_update_step to split hot full PPO update time. On 2026-05-11 at 1280 envs/GPU, build-update-inputs/rollout was about 0.56s per update (~73k steps/s), PPO update-only was about 0.136s (~301k steps/s), and full update was about 0.685s (~60k steps/s). This means the remaining correctness-preserving bottleneck is mostly rollout/model/env, not the PPO minibatch/backprop loop.

Do not retry a foundation-border workspace-touch guard without a new profile. Skipping foundation-border alignment when the current workspace did not touch a border tile slightly improved forced DO microprofiles, but full PPO at 1536 envs/GPU measured about 60.45k steps/s, below the retained pmap-donation baseline of about 60.8k, so the probe was reverted.

Treat these as controlled learning-speed tradeoffs, not defaults:

update_epochs=1 measured about 66.5k steps/s at 1280 envs/GPU.
num_minibatches=8 measured about 61.1k steps/s.
num_steps=64 measured about 61.0k steps/s.

Only keep such changes after comparing W&B learning curves by update count and env steps. Watch eval/positive_terminations, eval/rewards, progress/episode_completion_rate, explained_variance, value_loss, entropy, and action percentages.

Failure Lessons

65844986 reached Training: 0/95367 and then failed because NCCL was missing: Unable to load NCCL library. Multi-GPU collectives will not work. W&B had no summary/history.
65846373 reached Training: 0/1525878 and then failed with CUDNN_STATUS_INTERNAL_ERROR in convolution backward input. W&B had no summary/history.
66138248 was the no-mask action-mask A/B run (f43doigo). Slurm still showed RUNNING, but the log had CUDNN_STATUS_INTERNAL_ERROR, the Python process was sleeping, and GPU utilization was 0%. Treat this as a failed runtime allocation, not a no-mask learning result. Cancel stuck jobs in this state.
66145980 proved the no-mask 1024-env shape after XLA_FLAGS=--xla_gpu_autotune_level=0: one PPO update completed with --disable_action_mask and W&B disabled. Use this smoke pattern before relaunching any full job after a cuDNN autotune failure.
66148171 replaced the failed no-mask run (ho3duz3w) with autotune disabled, completed update 1, wrote a checkpoint, and continued at about 13.6k-13.9k steps/s on an RTX 2080 Ti. The masked comparison 66138246 (9wlth93n) ran on another RTX 2080 Ti at about 14.2k steps/s. Treat these as historical runtime evidence only; new Terra RL launches must use 3090/4090 allocations.
66129078 was cancelled after 31 minutes stuck at first-update compile with 128 envs/device; this is not a passed smoke. It showed that dataset/model startup is still far from proving a train update.
Local compile isolation on 2026-05-11 showed one synthetic 32x32 dynamic-action TerraEnv.step compiled in about 115 s on CPU on origin/multi-agent, while closing over constant DO_NOTHING compiled the no-reset body in about 4.4 s. The real step compiles every action branch.
Refactoring State._step from duplicated tracked/wheeled 16-way dispatch to one 8-way primitive dispatch removed duplicate DO/dig/dump compilation and reduced local one-env step compile to about 73.7 s. Keep the state-step-dispatch gate when touching action routing.
The trainer used to run jax.clear_caches() at update 0, which forced update 1 to retrace immediately. Keep cache clearing on completed intervals only: (i + 1) % cache_clear_interval == 0. On the local 4090, this reduced 4-update wall time by about 1.6x at 1024/2048/4096 envs/GPU without changing steady post-compile throughput.
TerraEnvBatch.step used to sample reset maps and build reset candidates every step even when no env was done. The fast-reset path now runs step_no_reset first and only enters reset-map sampling under jax.lax.cond(jnp.any(done), ...). On the local 4090 at 2048 envs/GPU, direct batch env-step throughput improved from about 36.7k to 60.6k steps/s when any_done_rate=0.0; full PPO 2048-env throughput improved from about 39.3k to 47.6k steps/s. Keep step_unconditional_reset_candidates as a reference path until mixed/forced-reset parity is cheap.
W&B state crashed plus empty history means no policy result. Report it as environment/runtime failure, not RL behavior.

If cuDNN autotune fails, try a tiny runtime check first, then a first-update smoke with:

export XLA_FLAGS=--xla_gpu_autotune_level=0

Mask/no-mask can produce slightly different JAX/XLA graphs, which can make cuDNN autotune choose different convolution algorithms. If one side fails and the other runs, do not infer that action masking fixed cuDNN or that the failed side learned worse; prove the failed side with a smoke and relaunch.

On Euler, generic GPU jobs may allocate disallowed nodes. Observed failures include one-GPU jobs on RTX 2080 Ti and a clean four-GPU baseline on Quadro RTX 6000. Always report GPU type from sacct AllocTRES and nvidia-smi, not from the job name or assumptions. New Terra RL validation runs must hard-fail unless all allocated GPUs are NVIDIA GeForce RTX 3090 or RTX 4090.

Do not use XLA_FLAGS=--xla_force_host_platform_device_count=4 for real GPU training; it is only for CPU-only checks.

Why Compile Is Slow

When asked why local or cluster tests are slow, use this explanation:

The fast unit gates are quick because they avoid the full GPU training graph.
The first train_mixed.py update compiles a large shape-specialized JAX/XLA program: vectorized Terra env reset/rollout, 32 steps, PPO update epochs/minibatches, conv model forward/backward, optimizer, and pmap collectives.
Large static shapes such as num_envs_per_device=1024, map tensors around 64x64, and rollout batches dominate compile before GPU utilization appears.
The logs already showed XLA constant-folding work inside Terra state code and cuDNN convolution autotune before any update completed.
Each fresh Slurm process can pay this compile cost again unless a persistent XLA cache is set up.

See references/jax-compile.md for mitigation ideas before changing training code.

References

references/euler-runtime.md: Slurm snippets, runtime checks, and failure signatures.
references/jax-compile.md: why first-update compile is slow and how to reduce test latency.

name	terra-rl
description	Run, validate, debug, and document Terra RL training in the terra and terra-baselines repositories. Use for Euler Slurm jobs, W&B run comparison, PPO/default solo-excavator configs, JAX/CUDA runtime preflight, first-update smoke tests, experiment ledgers, and diagnosis of slow compilation or GPU training failures.

terra-rl

More from this repository

More from this repository

Terra RL

Core Rule

Environment Selection

Euler GPU Selection

Launch Workflow

Current ResMap64 R1/R2 Launch

Current Default Solo Excavator Run

Reward Guidance

W&B Success Calibration

Euler Runtime

Local Environments

Performance Triage After Fast-Reset

Failure Lessons

Why Compile Is Slow

References

Terra RL

Core Rule

Environment Selection

Euler GPU Selection

Launch Workflow

Current ResMap64 R1/R2 Launch

Current Default Solo Excavator Run

Reward Guidance

W&B Success Calibration

Euler Runtime

Local Environments

Performance Triage After Fast-Reset

Failure Lessons

Why Compile Is Slow

References