Run any Skill in Manus with one click

experiment-bridge

Workflow 1.5: Bridge between idea discovery and auto review. Reads EXPERIMENT_PLAN.md, implements experiment code, deploys to GPU, collects initial results. Use when user says "实现实验", "implement experiments", "bridge", "从计划到跑实验", "deploy the plan", or has an experiment plan ready to execute.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/tqLi99/claude-skills-for-writing --skill experiment-bridge

Copy and paste this command into Claude Code to install the skill

Source

tqLi99/claude-skills-for-writing

Stars0

Forks0

UpdatedApril 2, 2026 at 02:21

SKILL.md

readonly

Workflow 1.5: Experiment Bridge

Implement and deploy experiments from plan: $ARGUMENTS

Overview

This skill bridges Workflow 1 (idea discovery + method refinement) and Workflow 2 (auto review loop). It takes the experiment plan and turns it into running experiments with initial results.

Workflow 1 output:                    This skill:                                    Workflow 2 input:
refine-logs/EXPERIMENT_PLAN.md   →   implement → GPT-5.4 review → deploy → collect → initial results ready
refine-logs/EXPERIMENT_TRACKER.md     code        (cross-model)    /run-experiment     for /auto-review-loop
refine-logs/FINAL_PROPOSAL.md

Constants

EXECUTION_MODEL = Sonnet — Use for implementation, script writing, deployment prep, and result collection.
DECISION_MODEL = Opus — Use only when the implementation exposes a real method-level conflict with the proposal.
CODE_REVIEW = true — GPT-5.4 xhigh reviews experiment code before deployment. Catches logic bugs before wasting GPU hours. Set false to skip.
AUTO_DEPLOY = true — Automatically deploy experiments after implementation + review. Set false to manually inspect code before deploying.
SANITY_FIRST = true — Run the sanity-stage experiment first (smallest, fastest) before launching the rest. Catches setup bugs early.
MAX_PARALLEL_RUNS = 4 — Maximum number of experiments to deploy in parallel (limited by available GPUs).
BASE_REPO = false — GitHub repo URL to use as base codebase. When set, clone the repo first and implement experiments on top of it. When false (default), write code from scratch or reuse existing project files.
COMPACT = false — When true, (1) read IDEA_CANDIDATES.md instead of full IDEA_REPORT.md if available, (2) append experiment results to EXPERIMENT_LOG.md after collection.

Override: /experiment-bridge "EXPERIMENT_PLAN.md" — compact: true, base repo: https://github.com/org/project

Project Automation Policy

Before acting, resolve automation defaults in this precedence order:

Inline command arguments
PROJECT_AUTOMATION.md in the project root
CLAUDE.md in the project root
The constants in this skill

For control, robotics, and systems projects, a good default is AUTO_DEPLOY=true only after the first sanity-stage run passes.

Read ../shared-references/model-routing-policy.md before implementing.

Default routing:

Sonnet owns experiment implementation, deployment prep, and result collection
gpt-5.4 reviews correctness before GPU spend
Opus is reserved for proposal-level conflicts or experiment-story changes

Inputs

This skill expects one or more of:

refine-logs/EXPERIMENT_PLAN.md (best) — claim-driven experiment roadmap from /experiment-plan
refine-logs/EXPERIMENT_TRACKER.md — run-by-run execution table
refine-logs/FINAL_PROPOSAL.md — method description for implementation context
IDEA_CANDIDATES.md — compact idea summary (preferred when COMPACT: true)
IDEA_REPORT.md — full brainstorm output (fallback)

If none exist, ask the user what experiments to implement.

Workflow

Phase 1: Parse the Experiment Plan

Read EXPERIMENT_PLAN.md and extract:

Run order and milestones — which experiments run first (sanity → baseline → main → ablation → polish)
For each experiment block:
- Dataset / split / task
- Compared systems and variants
- Metrics to compute
- Setup details (backbone, hyperparameters, seeds)
- Success criterion
- Priority (MUST-RUN vs NICE-TO-HAVE)
Compute budget — total estimated GPU-hours
Method details from FINAL_PROPOSAL.md — what exactly to implement

Present a brief summary:

📋 Experiment plan loaded:
- Milestones: [N] (sanity → baseline → main → ablation)
- Must-run experiments: [N]
- Nice-to-have: [N]
- Estimated GPU-hours: [X]

Proceeding to implementation.

Phase 2: Implement Experiment Code

This is an execution-heavy phase. Prefer Sonnet unless there is a genuine algorithm-design dispute.

If BASE_REPO is set — clone the repo first:

git clone <BASE_REPO> base_repo/
# Read the repo's README, understand its structure, find entry points
# Implement experiments by modifying/extending this codebase

For each milestone (in order), write the experiment scripts:

Check existing code — scan the project (or cloned base_repo/) for existing experiment scripts, model code, data loaders. Reuse as much as possible.
Implement missing pieces:
- Training scripts with proper argparse (all hyperparameters configurable)
- Evaluation scripts computing the specified metrics
- Data loading / preprocessing if needed
- Baseline implementations if not already present
- Fixed random seeds for reproducibility
- Results saved to JSON/CSV for later analysis
- Proper logging (wandb if configured in CLAUDE.md)
Follow the plan's run order — implement sanity-stage experiments first, then baselines, then main method, then ablations.
Self-review before deploying:
- Are all hyperparameters from EXPERIMENT_PLAN.md reflected in argparse?
- Is the random seed fixed and controllable?
- Are results saved in a parseable format (JSON/CSV)?
- Does the code match FINAL_PROPOSAL.md's method description?

Independent milestones with disjoint configs or scripts may be implemented in parallel.

When the parent session is not Sonnet, explicitly delegate implementation-heavy milestones to Sonnet workers instead of keeping them in the parent session:

spawn_agent:
  model: sonnet
  reasoning_effort: high
  message: |
    You own milestone [name] only.
    Implement the experiment scripts and configs for this milestone from EXPERIMENT_PLAN.md.
    Reuse existing project code where possible.
    Do not change method claims or redesign the algorithm.
    Return changed files, launch commands, and any blockers.

Use separate workers only when file ownership is disjoint.

Phase 2.5: Cross-Model Code Review (when CODE_REVIEW = true)

Skip this step if CODE_REVIEW is false.

Before deploying, send the experiment code to GPT-5.4 xhigh for review:

mcp__codex__codex:
  model: gpt-5.4
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    Review the following experiment implementation for correctness.

    ## Experiment Plan:
    [paste key sections from EXPERIMENT_PLAN.md]

    ## Method Description:
    [paste from FINAL_PROPOSAL.md]

    ## Implementation:
    [paste the experiment scripts]

    Check for:
    1. Does the code correctly implement the method described in the proposal?
    2. Are all hyperparameters from the plan reflected in the code?
    3. Are there any logic bugs (wrong loss function, incorrect data split, missing eval)?
    4. Is the evaluation metric computed correctly?
    5. **CRITICAL: Does evaluation use the dataset's actual ground truth labels — NOT another model's output as ground truth?** This is a common and severe bug.
    6. Any potential issues (OOM risk, numerical instability, missing seeds)?

    For each issue found, specify: CRITICAL / MAJOR / MINOR and the exact fix.

On review results:

No CRITICAL issues → proceed to Phase 3
CRITICAL issues found → fix them, then re-submit for review (max 2 rounds)
Codex MCP unavailable → skip silently, proceed to Phase 3 (graceful degradation)

Phase 3: Sanity Check (if SANITY_FIRST = true)

Before deploying the full experiment suite, run the sanity-stage experiment:

/run-experiment [sanity experiment command]

Wait for completion. Verify:

Training loop runs without errors
Metrics are computed and saved correctly
GPU memory usage is within bounds
Output format matches expectations

If sanity fails → fix the code, re-run. Do not proceed to full deployment with broken code.

Phase 4: Deploy Full Experiments

Deploy experiments following the plan's milestone order:

/run-experiment [experiment commands]

For each milestone:

Deploy experiments in parallel (up to MAX_PARALLEL_RUNS)
Use /monitor-experiment to track progress
Collect results as experiments complete

🚦 Checkpoint (if AUTO_DEPLOY = false):

🔧 Code implementation complete. Ready to deploy:

Milestone 0 (sanity): [status — passed/pending]
Milestone 1 (baseline): [N experiments, ~X GPU-hours]
Milestone 2 (main method): [N experiments, ~X GPU-hours]
Milestone 3 (ablations): [N experiments, ~X GPU-hours]

Total estimated: ~X GPU-hours on [N] GPUs

Deploy now? Or review the code first?

Phase 5: Collect Initial Results

As experiments complete:

Parse output files (JSON/CSV/logs) for key metrics
Training quality check — if W&B data is available (CLAUDE.md has wandb: true and wandb_project), invoke /training-check to detect NaN, loss divergence, plateaus, or overfitting. If W&B is not configured, skip silently.
Update refine-logs/EXPERIMENT_TRACKER.md — fill in Status and Notes columns
Check success criteria from EXPERIMENT_PLAN.md — did each experiment meet its bar?
Write initial results summary:

# Initial Experiment Results

**Date**: [today]
**Plan**: refine-logs/EXPERIMENT_PLAN.md

## Results by Milestone

### M0: Sanity — PASSED
- [result]

### M1: Baselines
| Run | System | Key Metric | Status |
|-----|--------|-----------|--------|
| R001 | baseline_1 | X.XX | DONE |

### M2: Main Method
| Run | System | Key Metric | Status |
|-----|--------|-----------|--------|
| R003 | our_method | X.XX | DONE |

### M3: Ablations
...

## Summary
- [X/Y] must-run experiments completed
- Main result: [positive/negative/inconclusive]
- Ready for /auto-review-loop: [YES/NO]

## Next Step
→ /auto-review-loop "[topic]"

Phase 5.5: Write Compact Log (when COMPACT = true)

Skip entirely if COMPACT is false.

Append each completed experiment to EXPERIMENT_LOG.md:

## [Run ID] — [timestamp]
- **System**: [method name]
- **Config**: [key hyperparameters]
- **Result**: [primary metric = X.XX]
- **Verdict**: [positive / negative / inconclusive]
- **Reproduce**: `python train.py --config configs/run_id.yaml --seed 42`

This structured log survives session recovery — downstream skills read it instead of parsing screen output.

Phase 5.6: Auto Ablation Planning

After main experiments (M2) complete with positive results, invoke /ablation-planner to design ablation studies:

Read the main results and method description
Generate a claim-driven ablation plan: which components to remove, what to compare, expected outcomes
Append ablation blocks to refine-logs/EXPERIMENT_PLAN.md and refine-logs/EXPERIMENT_TRACKER.md
If main results are negative or inconclusive, skip ablation planning and note in the summary

If /ablation-planner is not available, skip silently — the existing EXPERIMENT_PLAN.md ablation blocks (if any) remain unchanged.

Phase 6: Handoff

Present final status:

🔬 Experiment bridge complete:
- Implemented: [N] experiment scripts
- Deployed: [N] experiments on [M] GPUs
- Completed: [X/Y] must-run, [A/B] nice-to-have
- Main result: [one sentence]

Results: refine-logs/EXPERIMENT_RESULTS.md
Tracker: refine-logs/EXPERIMENT_TRACKER.md

Ready for Workflow 2:
→ /auto-review-loop "[topic]"

Key Rules

Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.
CRITICAL — Evaluation must use dataset ground truth. When writing evaluation scripts, ALWAYS compare model predictions against the dataset's actual ground truth labels/targets — NEVER use another model's output as ground truth. Double-check: (1) ground truth comes from the dataset split, not from a baseline/backbone model, (2) evaluation metrics are computed against the same ground truth for all methods, (3) if the task has official eval scripts, use those.
Follow the plan. Do not invent experiments not in EXPERIMENT_PLAN.md. If you think something is missing, note it but don't add it.
Sanity first. Never deploy a full suite without verifying the sanity stage passes.
Reuse existing code. Scan the project before writing new scripts. Extend, don't duplicate.
Save everything as JSON/CSV. The auto-review-loop needs parseable results, not just terminal output.
Update the tracker. EXPERIMENT_TRACKER.md should reflect real status after each run completes.
Don't wait forever. If an experiment exceeds 2x its estimated time, flag it and move on to the next milestone.
Budget awareness. Track GPU-hours against the plan's budget. Warn if approaching the limit.

Composing with Other Skills

/idea-discovery "direction"          ← Workflow 1: find + refine + plan
/experiment-bridge                   ← you are here (Workflow 1.5: implement + deploy)
/auto-review-loop "topic"            ← Workflow 2: review + iterate
/paper-writing "NARRATIVE_REPORT.md" ← Workflow 3: write the paper

Or use /research-pipeline for the full end-to-end flow (includes this bridge).

name	experiment-bridge
description	Workflow 1.5: Bridge between idea discovery and auto review. Reads EXPERIMENT_PLAN.md, implements experiment code, deploys to GPU, collects initial results. Use when user says "实现实验", "implement experiments", "bridge", "从计划到跑实验", "deploy the plan", or has an experiment plan ready to execute.
argument-hint	["experiment-plan-path-or-topic"]
allowed-tools	Bash(*), Read, Write, Edit, Grep, Glob, Agent, Skill, mcp__codex__codex, mcp__codex__codex-reply

experiment-bridge

More from this repository

More from this repository

Workflow 1.5: Experiment Bridge

Overview

Constants

Project Automation Policy

Inputs

Workflow

Phase 1: Parse the Experiment Plan

Phase 2: Implement Experiment Code

Phase 2.5: Cross-Model Code Review (when CODE_REVIEW = true)

Phase 3: Sanity Check (if SANITY_FIRST = true)

Phase 4: Deploy Full Experiments

Phase 5: Collect Initial Results

Phase 5.5: Write Compact Log (when COMPACT = true)

Phase 5.6: Auto Ablation Planning

Phase 6: Handoff

Key Rules

Composing with Other Skills

Workflow 1.5: Experiment Bridge

Overview

Constants

Project Automation Policy

Inputs

Workflow

Phase 1: Parse the Experiment Plan

Phase 2: Implement Experiment Code

Phase 2.5: Cross-Model Code Review (when CODE_REVIEW = true)

Phase 3: Sanity Check (if SANITY_FIRST = true)

Phase 4: Deploy Full Experiments

Phase 5: Collect Initial Results

Phase 5.5: Write Compact Log (when COMPACT = true)

Phase 5.6: Auto Ablation Planning

Phase 6: Handoff

Key Rules

Composing with Other Skills