| name | auto-review-loop |
| description | Autonomous multi-round research review loop. Supports two modes: (1) Plan-driven: takes an implementation plan file, executes TODO items respecting dependency DAG, uses Codex MCP to verify completion of each item. (2) Free-form: iterates review → fix → re-review until positive assessment. Use when user says 'auto review loop', 'review until it passes', or wants iterative improvement. |
| argument-hint | [topic-or-scope] [--plan path/to/plan.md] |
| allowed-tools | Bash(*), Read, Grep, Glob, Write, Edit, Agent, Skill, mcp__codex__codex, mcp__codex__codex-reply |
Auto Review Loop: Autonomous Research Improvement
Context: $ARGUMENTS
Step 0: Environment Check
Parse $ARGUMENTS for --remote flag:
- If
--remote is present → dispatch to remote server (see below), then STOP
- Otherwise → run locally, proceed to Mode Detection
Remote dispatch (only when --remote is specified)
ARIS_CONFIG=".aris/project.json"
if [ -f "$ARIS_CONFIG" ]; then
PROJECT_ID=$(python3 -c "import json;print(json.load(open('$ARIS_CONFIG'))['projectId'])")
API_URL=$(python3 -c "import json;print(json.load(open('$ARIS_CONFIG'))['apiUrl'])")
API_TOKEN=$(python3 -c "import json;print(json.load(open('$HOME/.claude/aris-api.json'))['token'])" 2>/dev/null)
CLEAN_ARGS=$(echo "$ARGUMENTS" | sed 's/--remote//')
PROMPT=$(printf '%s' "$CLEAN_ARGS" | python3 -c "import sys,json;print(json.dumps(sys.stdin.read()))")
if [ -n "$PROJECT_ID" ] && [ -n "$API_TOKEN" ]; then
RESULT=$(curl -s -X POST "$API_URL/api/aris/runs" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"projectId\":\"$PROJECT_ID\",\"workflowType\":\"auto_review_loop\",\"prompt\":$PROMPT,\"title\":\"Auto Review Loop\"}")
RUN_ID=$(echo "$RESULT" | python3 -c "import sys,json;d=json.load(sys.stdin);print(d.get('run',{}).get('id','') or d.get('error','UNKNOWN'))" 2>/dev/null)
echo "Dispatched to remote server. ARIS run ID: $RUN_ID"
fi
fi
After dispatching, STOP. Tell the user the run ID and that they can monitor on the ARIS dashboard.
Mode Detection
Parse $ARGUMENTS for --plan <path>:
- If
--plan is present → Plan-Driven Mode (Section A)
- Otherwise → Free-Form Review Mode (Section B)
Section A: Plan-Driven Mode
Execute an implementation plan with dependency-aware task ordering. Each TODO item is implemented, then verified by the Codex MCP reviewer before marking complete.
A.1 Initialization
-
Read the plan file specified by --plan <path>. This is a markdown file with Steps and TODO items.
-
Parse plan structure:
- Steps are top-level groups (
### Step N: Title)
- TODO items within steps (
#### TODO-N.M: Title)
- Items marked with ✅ are already completed — skip them
-
Load or create PLAN_STATE.json in project root:
{
"planFile": "docs/implementation_plan.md",
"status": "in_progress",
"currentNode": "TODO-1.1",
"completedNodes": ["TODO-1.0"],
"failedNodes": [],
"threadId": "019cd392-...",
"timestamp": "2026-03-17T10:00:00"
}
- If exists with
"in_progress" and within 24h → resume from currentNode
- Otherwise → fresh start
-
Register plan with ARIS API (if ~/.claude/aris-api.json exists):
ARIS_CFG="$HOME/.claude/aris-api.json"
if [ -f "$ARIS_CFG" ]; then
API_URL=$(python3 -c "import json;print(json.load(open('$ARIS_CFG'))['api_url'])")
API_TOKEN=$(python3 -c "import json;print(json.load(open('$ARIS_CFG'))['token'])")
RESULT=$(curl -s -X POST "$API_URL/api/aris/runs/register" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"projectId":"PROJECT_ID","prompt":"Plan: PLAN_FILE","status":"running","workflowType":"auto_review_loop","title":"Plan Execution"}')
ARIS_RUN_ID=$(echo "$RESULT" | python3 -c "import sys,json;print(json.load(sys.stdin).get('run',{}).get('id',''))")
PLAN_MD=$(cat PLAN_FILE)
curl -s -X POST "$API_URL/api/aris/runs/$ARIS_RUN_ID/plan" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"markdown\":$(python3 -c "import json,sys;print(json.dumps(sys.argv[1]))" "$PLAN_MD")}"
fi
-
Build execution order: Topological sort of the TODO DAG, respecting dependsOn relationships.
A.2 Execution Loop
For each TODO item in topological order:
Phase 1: Check Prerequisites
- Verify all items in
dependsOn are in completedNodes
- If any dependency failed → skip this item, mark as
skipped
- If dependencies not yet complete → this shouldn't happen in topological order, but wait/skip if it does
- If item is already in
completedNodes (from resume or ✅ marker) → skip
Phase 2: Implement
Read the TODO item's description from the plan file. It contains:
- What needs to be built/implemented
- File locations
- Expected behavior
- Any code snippets or references
Implement the TODO item. This may involve:
- Writing new code files
- Modifying existing code
- Running commands (pip install, tests, etc.)
- Creating data adapters
- Running experiments — check GPU availability first (see "Multi-Server Experiment Routing" section) and dispatch to the server with free GPUs
Phase 3: Verify via Codex MCP
After implementing, send the work to the Codex MCP reviewer for verification:
mcp__codex__codex (or mcp__codex__codex-reply if threadId exists):
config: {"model_reasoning_effort": "xhigh"}
prompt: |
I am executing an implementation plan. I just completed this TODO item:
**{TODO_KEY}: {TODO_TITLE}**
Description from plan:
{TODO_DESCRIPTION}
What I implemented:
{SUMMARY_OF_CHANGES}
Files changed:
{LIST_OF_FILES}
Test results (if any):
{TEST_OUTPUT}
Please verify:
1. Does the implementation match what the plan asked for? (Yes/No)
2. Are there any bugs, missing pieces, or quality issues? (List them)
3. Is this TODO item COMPLETE? (Yes/No/Partial)
4. If Partial, what specific remaining work is needed?
Be strict. Only mark complete if the implementation fully satisfies the plan.
Phase 4: Process Verification Result
Parse the Codex response:
- "Complete: Yes" → Mark as
completed, add to completedNodes, update ARIS API
- "Complete: Partial" → Implement the remaining work, then re-verify (max 2 retries)
- "Complete: No" → Fix issues, then re-verify (max 2 retries)
- After 2 failed retries → mark as
failed, add to failedNodes, continue to next item
Phase 5: Update State
After each TODO item:
-
Update PLAN_STATE.json:
{
"currentNode": "NEXT_TODO_KEY",
"completedNodes": ["TODO-1.0", "TODO-1.1", ...],
"timestamp": "NOW"
}
-
Update ARIS API plan node (if registered):
curl -s -X PATCH "$API_URL/api/aris/runs/$ARIS_RUN_ID/plan/TODO_KEY" \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"status":"completed","resultSummary":"CODEX_VERDICT_SUMMARY"}'
-
Write per-TODO review report to review/ folder:
Create review/TODO-X.Y.md (where X.Y is the TODO key, e.g. review/TODO-1.1.md):
# Review: TODO-X.Y — {TODO_TITLE}
**Status**: ✅ completed / ❌ failed / ⚠️ needs_redo
**Timestamp**: {ISO_TIMESTAMP}
**Retries**: {0 / 1 / 2}
## Plan Description
{TODO_DESCRIPTION verbatim from plan file}
## What Was Implemented
{SUMMARY_OF_CHANGES — what you actually built/modified}
**Files changed:**
- `path/to/file1.py` — description of change
- `path/to/file2.py` — description of change
**Test / command output:**
{TEST_OUTPUT or "N/A"}
## Codex Review Response
{FULL raw Codex response verbatim — do not summarize}
## Verdict
- **Match plan?**: Yes / No
- **Complete?**: Yes / Partial / No
- **Issues found**: {list or "None"}
- **Remaining work** (if Partial): {list or "N/A"}
Rules for this file:
-
Append to AUTO_REVIEW.md:
## TODO-X.Y: Title (timestamp)
- Status: completed/failed
- Reviewer verdict: [Codex response summary]
- Files changed: [list]
- Report: [review/TODO-X.Y.md](review/TODO-X.Y.md)
- Notes: [any observations]
Phase 6: Parallel Execution
When multiple TODO items have canParallel: true and share the same dependencies (all satisfied):
- List all ready parallel items
- Implement them sequentially (Claude can only do one at a time)
- But batch-verify: send all implementations to Codex in a single review prompt
- This is more efficient than individual reviews for independent items
A.3 Termination
When all TODO items are processed:
-
Set PLAN_STATE.json → "status": "completed"
-
Write final summary to AUTO_REVIEW.md:
## Plan Execution Summary
| TODO | Title | Status | Reviewer Verdict | Report |
|------|-------|--------|------------------|--------|
| TODO-1.0 | Environment setup | ✅ completed | Pre-verified | [report](review/TODO-1.0.md) |
| TODO-1.1 | Run SimpleMem | ✅ completed | Matches plan | [report](review/TODO-1.1.md) |
| TODO-2.1 | Implement C1 | ✅ completed | Code correct | [report](review/TODO-2.1.md) |
| TODO-4.3 | Implement C6 | ❌ failed | Missing graph builder | [report](review/TODO-4.3.md) |
| ... | ... | ... | ... | ... |
Completed: X/Y items
Failed: Z items (listed above with reasons)
Individual review reports: [`review/`](review/)
-
Update ARIS API run status → completed or failed
-
Upload review reports to ARIS API so client device can retrieve them:
ARIS_CFG="$HOME/.claude/aris-api.json"
if [ -f "$ARIS_CFG" ] && [ -n "$ARIS_RUN_ID" ] && [ -d "review" ]; then
python3 -c "
import os, sys, json, base64, urllib.request
cfg = json.load(open(os.path.expanduser('~/.claude/aris-api.json')))
api_url, token, run_id = cfg['api_url'], cfg['token'], sys.argv[1]
files = {}
for f in sorted(os.listdir('review')):
if f.endswith('.md'):
with open(os.path.join('review', f), 'rb') as fp:
files[f] = base64.b64encode(fp.read()).decode()
if not files:
sys.exit(0)
payload = json.dumps(files).encode()
req = urllib.request.Request(
api_url + '/api/aris/runs/' + run_id + '/review-reports',
data=payload,
headers={'Content-Type': 'application/json', 'Authorization': 'Bearer ' + token},
method='POST'
)
urllib.request.urlopen(req, timeout=30)
print('[ARIS] Uploaded ' + str(len(files)) + ' review report(s)')
" "$ARIS_RUN_ID" 2>/dev/null || true
fi
-
Feishu notification (if configured): Send pipeline_done with completion stats
Section B: Free-Form Review Mode
(Original behavior when no --plan is specified)
Autonomously iterate: review → implement fixes → re-review, until the external reviewer gives a positive assessment or MAX_ROUNDS is reached.
Constants
- MAX_ROUNDS = 4
- POSITIVE_THRESHOLD: score >= 6/10, or verdict contains "accept", "sufficient", "ready for submission"
- REVIEW_DOC:
AUTO_REVIEW.md in project root (cumulative log)
- REVIEWER_MODEL =
gpt-5.4 — Model used via Codex MCP
- HUMAN_CHECKPOINT = false — When
true, pause after each round's review and present the score + weaknesses to the user.
Override: /auto-review-loop "topic" — human checkpoint: true
State Persistence (Compact Recovery)
Persist state to REVIEW_STATE.json after each round:
{
"round": 2,
"threadId": "019cd392-...",
"status": "in_progress",
"last_score": 5.0,
"last_verdict": "not ready",
"pending_experiments": ["screen_name_1"],
"timestamp": "2026-03-13T21:00:00"
}
Workflow
Initialization
- Check for
REVIEW_STATE.json:
- Not exist or
"completed" → fresh start
"in_progress" older than 24h → fresh start
"in_progress" within 24h → resume from round + 1
- Read project docs, memory files, prior reviews
- Read recent experiment results
- Initialize round counter
Loop (repeat up to MAX_ROUNDS)
Phase A: Review via Codex MCP
mcp__codex__codex:
config: {"model_reasoning_effort": "xhigh"}
prompt: |
[Round N/MAX_ROUNDS]
[Full research context: claims, methods, results, known weaknesses]
[Changes since last round]
Please act as a senior ML reviewer (NeurIPS/ICML level).
1. Score this work 1-10 for a top venue
2. List remaining critical weaknesses (ranked by severity)
3. For each weakness, specify the MINIMUM fix
4. State clearly: is this READY for submission? Yes/No/Almost
Round 2+: use mcp__codex__codex-reply with saved threadId.
Phase B: Parse Assessment
Save FULL raw response verbatim. Extract: Score, Verdict, Action items.
STOP CONDITION: score >= 6 AND verdict "ready"/"almost" → stop, document.
Human Checkpoint (if enabled)
Present score + weaknesses, wait for user input.
Feishu Notification (if configured)
Send review_scored notification if ~/.claude/feishu.json exists.
Phase C: Implement Fixes
For each action item (highest priority first): code changes, experiments, analysis, documentation.
When running experiments, use multi-server routing: check /gpu-status or .aris/project.json servers and dispatch to whichever has free GPUs. Run independent experiments on different servers in parallel.
Phase D: Wait for Results
Monitor experiments, collect results.
Phase E: Document Round
Append to AUTO_REVIEW.md. Write REVIEW_STATE.json.
Termination
- Update state →
"completed"
- Write final summary with score progression table
- Feishu notification if configured
Multi-Server Experiment Routing (Both Modes)
When running experiments that require GPUs, use ALL available servers — not just one.
Before launching experiments:
-
Check GPU availability across all project servers:
python3 -c "
import json
cfg = json.load(open('.aris/project.json'))
for s in cfg['servers']:
print(f\"{s['name']}: {s['ssh']}\")
"
-
Run /gpu-status (or manually SSH to each server with nvidia-smi) to find free GPUs.
-
Route experiments to servers with free GPUs:
- Pick the server(s) with the most free GPUs
- For multi-GPU jobs, prefer servers with contiguous free GPUs
- If all GPUs on one server are busy, try the next server
- Run independent experiments on different servers in parallel when possible
Dispatching to a remote server:
ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no <SSH_COMMAND_FROM_PROJECT_JSON> \
"cd <remotePath> && CUDA_VISIBLE_DEVICES=<free_gpu_ids> <command>"
Key principles:
- Never hardcode a single server — always check availability first
- Parallel dispatch: if two experiments are independent, run them on different servers simultaneously
- Retry on a different server if one server's GPUs fill up during the run
- The prompt may include "Available experiment servers" — use ALL of them, not just the first
Key Rules (Both Modes)
- Large file handling: If Write tool fails due to file size, use Bash (
cat << 'EOF' > file) silently.
- ALWAYS use
config: {"model_reasoning_effort": "xhigh"} for Codex MCP calls
- Save threadId from first Codex call, use
mcp__codex__codex-reply for subsequent calls
- Be honest — include negative results and failed experiments
- Do NOT hide weaknesses to game scores
- Implement BEFORE re-reviewing/re-verifying
- Document EVERYTHING in
AUTO_REVIEW.md
- Plan mode: The Codex reviewer is the authority on whether a TODO is complete. Do not self-assess.
- Plan mode: Respect dependency ordering. Never implement a TODO whose dependencies haven't been verified complete.
Prompt Template for Round 2+ (Free-Form Mode)
mcp__codex__codex-reply:
threadId: [saved from round 1]
config: {"model_reasoning_effort": "xhigh"}
prompt: |
[Round N update]
Since your last review, we have:
1. [Action 1]: [result]
2. [Action 2]: [result]
Updated results table: [paste metrics]
Please re-score and re-assess.
Same format: Score, Verdict, Remaining Weaknesses, Minimum Fixes.