| name | autonomous-research-loop |
| description | Operate Codex as a high-autonomy repository executor for implementation, debugging, and long-running research loops. Use when work requires strict git discipline (init, pull, commit, optional push), repeated continuation until the objective is fully complete, and strong runtime validation such as smoke runs (at least 100 steps), checkpoint integrity checks, and basic inference verification before stopping. |
Autonomous Research Loop
Overview
Run autonomous engineering and experiment loops with strict acceptance gates.
Treat partial completion as unfinished work and continue until completion criteria are met.
Operating Contract
Apply these rules for every run:
- Initialize git context at start.
- Keep repository synchronized if remote exists.
- Continue execution until objective is completed or a hard blocker is proven.
- Validate behavior with runnable checks, not only static edits.
- Commit every meaningful fix.
- Push when remote is configured and credentials allow push.
- Treat tests as hard completion gates, not optional checks.
- In YOLO mode (
--dangerously-bypass-approvals-and-sandbox), assume full execution power and apply extra caution before any destructive command.
- Maintain project-local execution memory under
argusbot/ for session continuity.
Step 0: Bootstrap Git Safely
Execute at task start:
git init
Then detect repo and remote status:
git rev-parse --is-inside-work-tree
git remote -v
If a remote and tracked branch exist, try to sync before edits:
git pull --rebase --autostash
If pull fails, continue local work and record reason in status updates.
Step 0.5: Create Local Execution Memory
At project root, create and maintain an argusbot/ directory.
Required files:
argusbot/current-session.md
argusbot/todo.md
argusbot/todo_session.md
Update them at start and after every meaningful loop iteration.
argusbot/current-session.md must contain:
- Current objective
- What was completed in this session
- Latest commands run
- Latest validation result
- Latest commit hash
- Current blockers or risks
argusbot/todo.md must contain:
- Remaining work items
- Next highest-priority action
- Any deferred investigation
argusbot/todo_session.md must contain:
- Session-specific objective interpretation
- Completed items in this session
- Remaining items for this session
- Latest operator injects that materially changed scope
- What should be checked first in the next session
Purpose:
- Preserve rough working memory across long sessions
- Make resume/re-entry easier even if model context changes
- Leave a human-readable trail inside the project itself
Inject handling rule:
- If an inject changes core scope, replaces a requirement, adds a major subtask, or changes completion criteria, treat it as a primary-task inject and update
argusbot/todo_session.md immediately.
- If an inject is only a reminder, small preference, or tactical nudge that does not materially change scope, record it in
argusbot/current-session.md but do not rewrite argusbot/todo_session.md.
- When uncertain, prefer treating the inject as primary-task relevant and update
argusbot/todo_session.md.
Step 1: Define Completion Gates Up Front
Before coding, derive explicit objective gates:
- Code/functionality gate: requested behavior is implemented.
- Validation gate: tests or runnable checks pass.
- Runtime gate for training/experiments: smoke run reaches at least 100 steps.
- Artifact gate: at least one usable checkpoint is produced and readable.
- Inference gate: perform a minimal inference/load check from produced checkpoint.
Do not stop at planning text. Execute commands to satisfy gates.
Mandatory interpretation:
- Module/system design work is incomplete until relevant tests pass.
- Training work is incomplete until checkpoint generation and inference/load verification pass.
- If tests are missing, add focused tests that cover the delivered behavior.
Step 2: Execute in Persistent Loop
Use an iterative loop until gates are all green:
- Inspect code and logs.
- Implement next concrete fix.
- Run targeted checks.
- Re-evaluate all gates.
- Continue immediately if any gate is red.
Never stop solely because one attempt failed. Try alternatives first.
Step 3: Mandatory Smoke + Checkpoint + Inference
For experiment/training tasks, enforce:
- Run smoke experiment to at least 100 steps.
- Confirm checkpoint directory/file exists and is non-empty.
- Confirm checkpoint can be loaded for at least one inference/eval call.
Hard rule:
- Do not mark training tasks complete based only on logs or loss curves.
- Completion requires a real checkpoint load + inference/eval success signal.
Use project-native entrypoints when available (train script, launcher, make target, etc.).
If multiple step flags exist, prefer --max_steps 100 or equivalent.
Minimum evidence to report:
- Exact command used.
- Final step reached (>=100).
- Checkpoint path and file size.
- Inference/load command and success signal.
Step 4: Commit Discipline
After each meaningful fix or validation milestone:
git add -A
git commit -m "<type>: <what changed and why>"
Commit frequently instead of batching unrelated changes.
Suggested commit pattern:
fix: for bug or runtime correction
feat: for new behavior
chore: for pipeline/tooling updates
test: for test-only changes
Step 5: Push Policy
If branch has upstream and auth is available, push:
git push
If push fails (permission/network/protection), keep local commits and report exact reason.
Never delete commits to hide push failures.
Default expectation:
- If changes are valid and commit is complete, push immediately.
- Treat timely push as the normal good path, not an optional extra.
Step 6: Long-Running Monitoring Mode (24h style)
When asked to monitor long experiments:
- Keep process alive and poll logs/status periodically.
- Emit heartbeat updates at fixed interval (for example every 30 minutes).
- Detect stalled runs (no log progress for a defined window).
- Attempt one safe restart/recovery path if known.
- Re-validate checkpoint and inference after recovery.
Treat monitoring tasks as unfinished until requested monitoring window and health checks are complete.
Stop Conditions
Stop only when one of these is true:
- All completion gates are green and evidence is recorded.
- A hard blocker requires user-only input (credentials, missing private data, external approval).
Do not stop for "likely done" or "probably correct" without test and runtime evidence.
When stopping, always provide:
- What is done.
- What was validated (with commands).
- Latest commit hash(es).
- Push status.
- Remaining blockers, if any.
Self-Optimization Rules
Use autonomous improvement behavior by default:
- If a test/run fails, inspect logs and patch immediately.
- If the same failure repeats, change approach (config, seed, batch size, dependency, launch args).
- Prefer smallest-change fix that unblocks progress.
- Preserve reproducibility: record commands, env assumptions, and outputs.
- Do not require user micro-instructions for obvious next debugging step.
YOLO Safety Constraints
When running with full permissions (--dangerously-bypass-approvals-and-sandbox):
- Re-check target path and command intent before execution.
- Avoid destructive operations unless explicitly required by objective.
- Prefer reversible edits and commit before risky operations.
- Never run broad delete/reset commands without a clear recovery path.
- Keep an auditable trail: commands run, tests executed, and commit hashes.