Run any Skill in Manus with one click

$pwd:

experiment

Name: Experiment
Author: YuriNakayama

// Hypothesis-list–driven iteration loop for Orbit Wars experiments under `backend/pipeline/{imitation|rulebase|reinforce}/case<N>/`. The single source of truth is the target case's `hypotheses.md` (created by `experiment-hypothesize`). Hears 3 things from the user via `AskUserQuestion`: (1) the target `hypotheses.md` path, (2) the max iteration count, and (3) the loop cadence. Registers a `schedule` (cron) or `loop` (short interval) job whose body is "call `experiment-plan` (which decides list-consume / deepen / broaden), then `experiment-execution`, then `experiment-analysis`, once". Each tick = one hypothesis = one plan → run → analyze cycle. The loop driver itself does NOT decide deepen / broaden — `experiment-plan` Step 0.5 owns that. The driver only reads the plan's reported mode: a broaden outcome stops the loop and surfaces the `/experiment-hypothesize` suggestion; list-consume / deepen continue the cycle. Repeats until hypotheses are exhausted, max iterations are hit, or the user stops the loop. **This

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 6, 2026 at 03:08

File Explorer

2 files

SKILL.md

readonly

package.json

"author": "YuriNakayama"

"repository": "YuriNakayama/orbit-wars"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

name

experiment

description

Hypothesis-list–driven iteration loop for Orbit Wars experiments under `backend/pipeline/{imitation|rulebase|reinforce}/case<N>/`. The single source of truth is the target case's `hypotheses.md` (created by `experiment-hypothesize`). Hears 3 things from the user via `AskUserQuestion`: (1) the target `hypotheses.md` path, (2) the max iteration count, and (3) the loop cadence. Registers a `schedule` (cron) or `loop` (short interval) job whose body is "call `experiment-plan` (which decides list-consume / deepen / broaden), then `experiment-execution`, then `experiment-analysis`, once". Each tick = one hypothesis = one plan → run → analyze cycle. The loop driver itself does NOT decide deepen / broaden — `experiment-plan` Step 0.5 owns that. The driver only reads the plan's reported mode: a broaden outcome stops the loop and surfaces the `/experiment-hypothesize` suggestion; list-consume / deepen continue the cycle. Repeats until hypotheses are exhausted, max iterations are hit, or the user stops the loop. **This skill is invoked ONLY by explicit user request** (typed `/experiment` or said "仮説リストを回したい / iteration loop で実験したい / hypotheses.md を消化して"). Don't auto-trigger this skill from natural-language experiment requests — those go to `experiment-hypothesize` (hypothesis enumeration) / `experiment-plan` (single plan) / `experiment-execution` (single run + RunPod) / `experiment-analysis` (single replay post-mortem) instead.

Experiment Skill (Orbit Wars — iteration loop)

Iteration orchestrator that consumes hypotheses.md in priority order. One tick = one hypothesis = one experiment-plan → experiment-execution → experiment-analysis cycle.

This skill itself does no heavy lifting. It delegates to the four child skills:

Hearing → confirm target hypotheses.md and loop settings
Register the loop via schedule / loop
On each tick: pick the next hypothesis from hypotheses.md by priority → call the three child skills in order → write results back → branch on deepen / broaden

When this skill is in charge

The user typed /experiment (explicit trigger), or
The user explicitly asked for a multi-iteration loop: "hypotheses.md を消化して", "仮説 5 個を順に試したい", "iteration loop で実験回して", "止めるまで仮説リストを消化し続けて".
The user did not ask for a single experiment. Single-shot requests go to experiment-execution directly. Hypothesis enumeration goes to experiment-hypothesize.

If the target case has no hypotheses.md, redirect to experiment-hypothesize first. This skill is built around an existing hypotheses.md; it does not start from an empty list.

Skill flow

Phase 0 — User hearing (3 questions)

Run one round of AskUserQuestion (3 questions). Selection + Other format. The first question's description: 各質問で「Other」を選ぶと自由記述も可能です。

Target hypotheses.md path — ls docs/experiment/<family>/*/hypotheses.md, surface up to 5 candidates as family/case<N>/{topic} (残 X 仮説). Other accepts a free-text path. If no matching file exists, suggest /experiment-hypothesize で先に作成 and terminate this skill (we don't start from an empty list).
Max iterations (stop condition):
- 1 (smoke) / 3 / 5 ⭐推薦 / 10 / 仮説リストを使い切るまで / 上限なし (ユーザー停止指示まで) ⚠️ / Other
- When "上限なし" is selected, surface a one-line RunPod cost-accumulation warning.
Loop cadence:
- loop 60s / loop 5m / schedule */15min / schedule hourly / schedule daily / 逐次 (周期なし、前のティックが完了したら即次) ⭐推薦 / Other
- Cadence vs. tick duration must align: a tick that includes imitation training + RunPod evaluation runs 30 minutes to several hours. A cron interval shorter than this means concurrent launches = doubled GPU spend. When schedule is selected, confirm cadence ≥ expected tick duration. Short cadences are fine when RunPod を使わない is set in the skip list.

Phase 1 — Inspect `hypotheses.md` state

Read the chosen hypotheses.md:

If the header 状態: is paused / stopped, ask "Resume?" in 1 question (Yes → write back to in_progress; No → terminate).
Count pending (- [ ]) hypotheses in the list. Zero → tell the user "all consumed; want to enumerate more? → /experiment-hypothesize" and terminate.
Cache the 実施しない検証 / 評価 (skip list) in memory (each tick passes it through to child skills; transparent passthrough).

Don't use a queue file. hypotheses.md is the single source of truth.

Phase 2 — Register the loop

Per the cadence chosen in Phase 0:

`loop` cadence

Call the loop skill as /loop {interval} /experiment. The Phase 3 tick driver re-reads hypotheses.md on every wake-up and decides its own state.

`schedule` cadence

Call the schedule skill to register a cron job whose body is "run a /experiment tick in this worktree". One cron firing = one tick. On stop, cancel the cron.

`逐次` cadence

No loop / schedule registration. Run a while not stopped loop directly in the main session. Each tick runs in foreground; the next tick starts as soon as the previous one finishes. Lowest collision risk when RunPod training is in the loop.

Phase 3 — Tick driver (1 iteration)

Each tick performs these steps in order (= one hypothesis = one cycle). The driver does not decide deepen / broaden itself — experiment-plan Step 0.5 owns that decision. The driver reads the mode from experiment-plan's report and acts accordingly.

Read hypotheses.md (every time; do not hold in-memory state across cron firings).
Pre-check stop conditions (only conditions the driver can decide alone):
- Completed (- [x]) iter count ≥ max iterations → cancel cron / loop, report, terminate.
- Header 状態: is already stopped (user-set) → exit immediately.
- 状態: = paused → exit (the previous tick's experiment-plan already set this; surface the broaden suggestion once and stop).
The "no pending hypothesis" condition is not decided here — let experiment-plan handle it (it may emit deepen instead of stopping).

Call experiment-plan with no specific hypothesis — Skill(skill="experiment-plan"). The skill runs Step 0.5 (mode decision) and returns one of three outcomes. The driver reads the mode from the report:

Mode reported by experiment-plan	Driver behavior
`list-consume`	`iterN_plan.md` was written for the priority-picked hypothesis. Continue to step 4.
`deepen`	One follow-up hypothesis was appended to `hypotheses.md` and `iterN_plan.md` was written for it. Continue to step 4.
`broaden`	`状態:` was rewritten to `paused`; no plan file. Stop the loop, cancel the cron, surface the "/experiment-hypothesize で再構築推奨" line, and terminate.

Call experiment-execution — Skip its Phase 1 (hypothesis is already pinned by the plan), run Phases 2–8: reality check → implementation → smoke (mandatory unless skip list disables it) → tests → push → RunPod → monitoring → auto-recover → evaluation → iterN_result.md → append a provisional row to the Iteration log. Capture run_id and the primary metric.
Call experiment-analysis — Phase 1.5 defaults to "直前 iter のみ". When the skip list contains replay 分析を行わない, run in skip mode. Phase 4.5 writes the adoption verdict only (checkbox flip + Iteration log row update + 最終更新). It does not touch 状態: and does not append new hypotheses.
Per-tick report (≤ 3 lines):
- iter N/{max}: mode (list-consume / deepen) + family/case + hypothesis (one sentence)
- primary metric + verdict
- next-tick schedule (schedule → next cron time, loop → interval, 逐次 → "starting next hypothesis immediately"). On the next tick, experiment-plan may emit broaden if rejections accumulated.

Phase 4 — Stop and finalize

The loop stops when:

max iterations reached
all hypotheses consumed
genre is paused (3 consecutive rejections) — surface the broaden suggestion
the user says "stop"

A failed tick does not auto-pause the loop. Failures are recorded in the hypotheses.md Iteration log as 失敗 (by experiment-execution); the hypothesis row stays - [ ] so the user can re-queue later. Whether consecutive failures indicate a structural problem is a user judgment call.

On stop:

Cancel cron / loop — always remove the schedule job; stop the loop. Tear down any dev/runpod watch you launched. Confirm via dev/runpod ps that no stale cron / pod remains.
Update hypotheses.md state — rewrite the header 状態: to completed (all consumed) / paused (consecutive rejected) / stopped (user stop), and append a one-line stop reason to the Iteration log.
Aggregate report (5–10 lines, Japanese):
- completed iter count / max
- verdict counts (adopted / rejected / inconclusive)
- highest win-rate iter (family/case + run_id + win-rate)
- hypotheses.md path + each iter's result.md / analysis.md path
- recommended next action (e.g. "iter3 で +5pp 採用、dev/runpod promote <run_id> 実行 (user 確認後) → Kaggle submit は別途手動承認" / "ジャンル paused、/experiment-hypothesize で再構築推奨")

Risk gates this skill enforces

Loop is invoked only on explicit request. Single-shot natural-language requests do not trigger it.
Redirect when hypotheses.md is missing. Don't start from an empty list; suggest /experiment-hypothesize.
RunPod GPU cost accumulates. Surface a one-line cost warning when "上限なし" is selected in Phase 0.
Cadence ≥ expected tick duration. Concurrent launches = doubled GPU spend.
Always cancel cron / loop on stop and confirm no pod is left running.
hypotheses.md is the single source of truth. No in-memory state across cron firings. Queue files are abolished.
Mode decision belongs to experiment-plan Step 0.5. The tick driver does not decide list-consume / deepen / broaden itself — it reads the plan's reported mode and stops on broaden.
Adoption write-back belongs to experiment-analysis. The tick driver does not flip checkboxes.
Never auto-spawn experiment-hypothesize on broaden. Surface the suggestion and stop the loop.
Kaggle submission and dev/runpod promote are out of scope. Canonical-weights updates and submissions require explicit user approval.
Pass the hypotheses.md skip list straight through to child skills. Do not override it inside the loop driver.

Common shapes

User says…	Skill behavior
`/experiment`	Phase 0 hearing → confirm target `hypotheses.md` → Phase 2 register loop → Phase 3 starts ticks.
"imitation/case1 の hypotheses.md を 5 個分回して、夜中に回しておきたい"	Phase 0 confirms path / max=5 / `schedule hourly` (after aligning RunPod tick duration) → register cron → aggregate report in the morning.
"止めるまで仮説出し続けたい"	Phase 0 max="上限なし" + RunPod cost warning → recommend `逐次` → report each tick, continue until the user says stop.
"仮説 1 個だけ試して"	Not a loop — redirect to `/experiment-execution`.
"今走ってるループ止めて"	Phase 4: cancel cron / loop → set `hypotheses.md` state to `stopped` → aggregate report.
"case1 の hypotheses.md がまだ無い"	Redirect: `/experiment-hypothesize で先に作成してください`.
"rejected 続きで paused になった"	Surface the broaden suggestion (`/experiment-hypothesize`). No auto-spawn.

Things to avoid

Pulling single-shot requests into the loop. Without an explicit loop request, call the child skills directly.
Starting the loop without hypotheses.md. Always redirect.
Skipping the cadence-vs-tick-duration alignment check. Concurrent launches double GPU spend.
Forgetting to cancel cron / loop on stop. Stale crons trigger unintended RunPod launches and charges.
Creating a queue file. hypotheses.md is the single store.
Toggling hypothesis-row checkboxes from the tick driver (that is experiment-analysis's job).
Deciding deepen / broaden in the tick driver (that is experiment-plan Step 0.5's job).
Calling experiment-plan with a specific hypothesis from the loop. Always call it without args so Step 0.5 can decide the mode for this tick.
Citing Kaggle publicScore in the verdict.
Auto-running dev/runpod promote or Kaggle submission from the loop.
Auto-spawning experiment-hypothesize after consecutive rejections (suggest only).

Language

Internal reasoning and thinking should be in English
All user-facing output, AskUserQuestion labels/descriptions, and tick reports must be in Japanese (per .claude/CLAUDE.md)

experiment

Experiment Skill (Orbit Wars — iteration loop)

When this skill is in charge

Skill flow

Phase 0 — User hearing (3 questions)

Phase 1 — Inspect hypotheses.md state

Phase 2 — Register the loop

loop cadence

schedule cadence

逐次 cadence

Phase 3 — Tick driver (1 iteration)

Phase 4 — Stop and finalize

Risk gates this skill enforces

Common shapes

Things to avoid

Language

Experiment Skill (Orbit Wars — iteration loop)

When this skill is in charge

Skill flow

Phase 0 — User hearing (3 questions)

Phase 1 — Inspect hypotheses.md state

Phase 2 — Register the loop

loop cadence

schedule cadence

逐次 cadence

Phase 3 — Tick driver (1 iteration)

Phase 4 — Stop and finalize

Risk gates this skill enforces

Common shapes

Things to avoid

Language

Phase 1 — Inspect `hypotheses.md` state

`loop` cadence

`schedule` cadence

`逐次` cadence

Phase 1 — Inspect `hypotheses.md` state

`loop` cadence

`schedule` cadence

`逐次` cadence