| name | auto-research |
| description | Autonomous NeMo-RL research agent workflow for directed hypothesis testing and open-ended discovery. Guides agents through the full experiment lifecycle: understanding recipes and environments, wiring RL or NeMo-gym runs, launching reproducible baselines and iterations, analyzing results, preserving human oversight, and using git plus TSV logs as the research ledger. |
| when_to_use | auto research; run experiments; test these hypotheses; find a better recipe; improve accuracy; long-running NeMo-RL or NeMo-gym research campaigns; autonomous discovery; directed execution. |
Auto Research
Run iterative NeMo-RL experiments in this repository against the user's stated objective, such as accuracy, reward, throughput, latency, stability, or another recipe-specific metric, with git as the research ledger.
Treat dependencies as ready, but choose the runtime deliberately. Use the recipe's authoritative metric as the source of truth. Keep changes small, reproducible, and simple. Preserve unrelated user work.
Use the session-memory skill for every auto-research campaign. Start or resume a session record before branching, then checkpoint after forming the plan, before and after meaningful edits or long-running launches, when the user changes direction, and before handoff or final summary.
After context compaction, handoff, disconnect, or a long gap, reload this skill and any companion skills already in use, read the latest session-memory handoff, and restate the overall objective, stop rules, current branch, and latest result before continuing. Treat follow-up steering as additive unless the user explicitly changes the main objective.
Workflow
- Inspect the current git state and identify unrelated user changes before branching.
- Use a shared branch prefix. Prefer a user-provided one; otherwise create a suggestive default such as
autoresearch/2026-03-24-dapo-qwen2p5.
- Read the target recipe, its parents, and the relevant code paths in
examples/run_grpo.py, nemo_rl/models/, nemo_rl/algorithms/, nemo_rl/environments/, and docs/. For NeMo-gym recipes, also inspect examples/nemo_gym/ entrypoints, configs, and launch scripts.
- Translate any user stop rule into explicit values you can monitor, such as the requested number of experiments as
target_experiment_count, campaign_deadline, per_experiment_timeout, or target_metric.
- Verify required data, checkpoints, runtime inputs, and the launcher.
- Create an untracked TSV log and per-experiment log directory.
- Run a baseline first on
<prefix>/baseline if none exists.
For GPU, CPU-heavy, distributed, or long-running work, choose the execution environment deliberately. Run locally when the current machine has suitable GPUs and capacity; otherwise follow the user's requested environment, use launch-nemo-rl for nrl-k8s/Kubernetes, use the environment's native launcher for Slurm, or clarify with the user before launching. Use CPU-only local runs only for light inspection, dry runs, and short non-GPU checks.
If the user mentions Brev, or if /home/ubuntu/RL exists and /ephemeral is available as a volume, treat the machine as a Brev instance and use brev-etiquette before creating experiment directories, caches, logs, checkpoints, or authenticated runtime state.
Branching
- Put every experiment on its own branch under the shared prefix.
- Keep every branch, even for failed or weak ideas.
- Put at least one commit on each branch for the hypothesis.
- Add follow-up fix commits on the same branch when a rerun is justified.
- Never stash, reset, or overwrite unrelated user changes silently. If dirty files overlap the experiment, use a separate worktree or ask before proceeding.
See references/git-workflow.md for the exact pattern.
Loop
- Pick one concrete hypothesis.
- Create a branch such as
autoresearch/2026-03-24-dapo-qwen2p5/prompt-compact-schema.
- Edit the smallest set of files needed.
- Commit the hypothesis.
- Before launching the run, check the monitored stop conditions. Do not stop early unless one is already clearly met.
- Identify the authoritative metric source from the recipe or logging code, then run with a unique log path:
LOG_DIR=reports/auto_research/<campaign>/<experiment>
mkdir -p "$LOG_DIR"
uv run <entrypoint> > "$LOG_DIR/run.log" 2>&1
- If the user gave a per-experiment wall-clock limit, enforce it explicitly. Prefer a recipe-level timeout when one already exists; otherwise wrap the command with an external timeout. If both exist, honor the tighter limit.
- Extract the primary metric with a command appropriate for the actual log format. If extraction is empty, inspect the last log lines and the recipe's logging path before marking the run.
- Record index, branch, parent commit, commit, recipe, metric name, metric value, memory (GB), elapsed time (minutes), launcher, job id, command, log path, status, and description in the TSV, along with enough timing or count information to evaluate the stop rule.
- Periodically print user-facing progress updates during the campaign. Include the current branch, latest known result, attempted experiment count, remaining experiment count if applicable, remaining campaign time if applicable, and whether any stop condition has been met yet.
- Re-check the monitored stop conditions after the experiment completes and state the result explicitly, for example
stop condition not yet met: 17/24 attempted, 6h12m remaining or stop condition met: 24/24 attempted.
- Mark the result as
keep, discard, or crash, then move to the next branch unless a user-specified stop condition has been clearly met.
For count-based stop rules, count attempted ideas, not only successful or fully completed runs.
For campaign time budgets, convert the user limit into an absolute deadline at the start of the campaign and keep checking remaining time.
For per-experiment budgets, enforce a timeout on every run and treat overruns as failures.
Examples:
do 50 experiments: stop only after 50 attempted experiment rows exist in the TSV
10h total, 1h each: enforce a 1 hour limit per run and stop when the 10 hour campaign budget is reached, or when there is not enough remaining budget to start another 1 hour run
50 experiments or 10h total, 1h each: monitor all three values, never exceed the per-run cap, and stop only when one campaign-level stop trigger is clearly reached
Priorities
Prefer ideas with high expected objective gain and low complexity cost:
- correctness and backend compatibility
- prompt and rollout formatting
- batch, sequence, and precision layout
- optimizer and scheduler tuning
- reward shaping, clipping, or scaling
- dataset mix or validation changes
- synchronous versus asynchronous execution based on hardware
All else equal, prefer simpler wins and avoid brittle hardware-specific hacks.
Avoid
- Do not conclude a training idea failed from an underpowered smoke run. If a run uses tiny batch sizes, very few optimizer steps, or otherwise non-representative settings, treat it as plumbing validation only; scale to a meaningful batch size and train long enough to test the hypothesis before marking it
discard.
- Do not repeatedly pay batch-scheduler setup costs for tight edit-run-debug loops. If Slurm batch jobs have a large startup tax and failures require quick iteration, use the documented interactive Slurm pattern or ask the user before resubmitting more batch jobs.
- Do not let context compaction or follow-up steering questions erase the original campaign goal. Refresh
session-memory, reload active skills, and preserve the main objective unless the user explicitly changes it.
Stop
If the user gives explicit stopping conditions, they override the generic rule. Do not stop because the search feels sufficient; stop only when the requested count, deadline, budget, or target condition has been clearly met.
During the campaign, explicitly inform the user whether the stop condition has been met. If not, report the remaining count, remaining time, or other remaining threshold in concrete terms.
If the user does not give explicit stopping conditions, run the baseline plus up to three low-risk experiments, then summarize the best result and ask before continuing.
References
references/git-workflow.md for branch, dirty-worktree, parent-commit, and baseline rules.
references/exploration-ideas.md for turning symptoms into concrete hypotheses.
references/experiment-log-template.md for the TSV schema and reproducibility fields.