Run any Skill in Manus with one click

$pwd:

experiment-hypothesize

Name: Experiment Hypothesize
Author: YuriNakayama

// Hypothesis-list builder for Orbit Wars experiments under `backend/pipeline/{imitation|rulebase|reinforce}/case<N>/`. Runs **once** per family / case to enumerate the hypotheses worth testing for that genre via code investigation + (opt-in) web research + a structured hearing, and emits `docs/experiment/{family}/{yyyymmdd}_case{N}_{topic}/hypotheses.md` as a checklist together with a skip list (`実施しない検証 / 評価`). The downstream `experiment` (loop driver) treats this `hypotheses.md` as the single source of truth. Read-only on code: does NOT implement, NOT push, NOT launch RunPod, NOT write `iterN_plan.md`. Use whenever the user types `/experiment-hypothesize`, or asks to enumerate / brainstorm / organize hypotheses for a case — phrases like "case1 の仮説をリストアップして", "dropout 周りで試すべき仮説を網羅したい", "新しい case の調査方針を整理したい", "hypotheses.md を作りたい", "ジャンル単位でやることを並べて" all count. Don't trigger this skill for single-iteration planning of one hypothesis (use `experiment-plan`), full-pipeline execution (use `experiment-execution`),

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 6, 2026 at 03:08

SKILL.md

readonly

package.json

"author": "YuriNakayama"

"repository": "YuriNakayama/orbit-wars"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

name

experiment-hypothesize

description

Hypothesis-list builder for Orbit Wars experiments under `backend/pipeline/{imitation|rulebase|reinforce}/case<N>/`. Runs **once** per family / case to enumerate the hypotheses worth testing for that genre via code investigation + (opt-in) web research + a structured hearing, and emits `docs/experiment/{family}/{yyyymmdd}_case{N}_{topic}/hypotheses.md` as a checklist together with a skip list (`実施しない検証 / 評価`). The downstream `experiment` (loop driver) treats this `hypotheses.md` as the single source of truth. Read-only on code: does NOT implement, NOT push, NOT launch RunPod, NOT write `iterN_plan.md`. Use whenever the user types `/experiment-hypothesize`, or asks to enumerate / brainstorm / organize hypotheses for a case — phrases like "case1 の仮説をリストアップして", "dropout 周りで試すべき仮説を網羅したい", "新しい case の調査方針を整理したい", "hypotheses.md を作りたい", "ジャンル単位でやることを並べて" all count. Don't trigger this skill for single-iteration planning of one hypothesis (use `experiment-plan`), full-pipeline execution (use `experiment-execution`), result interpretation (use `experiment-analysis`), or multi-iteration loop orchestration (use `experiment`).

Experiment Hypothesize Skill (Orbit Wars)

Per-case hypothesis-list builder, run once per family / case. The single deliverable is hypotheses.md. All downstream skills (experiment-plan / experiment-execution / experiment-analysis / experiment) consume this file as input.

When this skill is in charge

The user typed /experiment-hypothesize (explicit trigger), or
The user asked, in plain language, for any of:
- enumerate hypotheses for a given case
- brainstorm follow-ups around a knob (e.g. dropout)
- organize an investigation plan before opening a new case
- draft a hypotheses.md
The user did not ask to plan / implement / run a single hypothesis. Single-shot requests go to experiment-plan / experiment-execution directly.

If hypotheses.md already exists for the target case, ask whether to rebuild / append / read-only display before proceeding.

Scope boundaries (what this skill does NOT do)

Does not write iterN_plan.md or any per-hypothesis plan file. That is experiment-plan's job.
Does not edit code under backend/pipeline/.
Does not commit or push.
Does not launch RunPod or run any dev/* command.
Does not iterate. The skill writes hypotheses.md once and exits.

Skill flow

5 phases. Phase 1 = code reality check, Phase 2 = opt-in web research, Phase 3 = hypothesis hearing, Phase 4 = skip list hearing, Phase 5 = write hypotheses.md.

Phase 0 — Confirm the target case

AskUserQuestion, 1 question:

Q	options
対象 case	`imitation/case<既存>` / `rulebase/case<既存>` / `reinforce/case<既存>` / `新規 case を切る` / `Other` (free-text `family/caseN/topic`)

Description on the first question: 各質問で「Other」を選ぶと自由記述も可能です。

For a new case, propose the next free number from ls backend/pipeline/<family>/. Resolve {topic} (snake_case 1–3 words) in the same question (Other recommended).

If a hypotheses.md already exists at docs/experiment/<family>/{yyyymmdd}_case{N}_{topic}/hypotheses.md, follow up with one more AskUserQuestion: 再構築 (back up existing and reset) / 追記 (append new hypotheses only) / 読み直すだけ (display current contents and exit).

Phase 1 — Code investigation (read-only)

Read the relevant code so the upcoming hearing is concrete. No user interaction.

Mandatory reads (skip those that don't apply):

ls backend/pipeline/<family>/ — confirm existing case numbers and the next free slot.
For an existing case: backend/pipeline/<family>/case<N>/ — main.py, agent.py, primary policy / feature modules.
ls docs/experiment/<family>/ — surface prior experiment directories. Read 1–2 recent result.md / iter*_result.md files to capture what has already been tried + adopted / rejected.
backend/src/dataset/selfplay/agents.py — AGENT_REGISTRY (relevant when registering a new case).

Surface a 3–5 line orientation report to the user (in Japanese):

pipeline/<family>/case<N>/ の現状: X / Y / Z。過去 iter で採用済み: A、却下済み: B。次の調査余地: C / D。

Phase 2 — Web research (opt-in)

AskUserQuestion, 1 question:

Q	options
外部 web 調査	`不要 — Phase 1 のコード調査で十分` / `必要 — Other で具体的なクエリを指定` / `必要 — トピックを skill 側で提案`

If the user picks "不要", skip to Phase 3.

If "必要", run WebSearch / WebFetch for 2–4 queries (time-box: ≤ 5 minutes). Bias toward authoritative sources (Kaggle writeups, arXiv, official docs, well-cited GitHub). Capture findings as 1–3 bullet points per source — these will be folded into Phase 5's References section. If the topic warrants a deeper survey, recommend /research-retrieval and stop.

Phase 3 — Hypothesis hearing

AskUserQuestion runs in at most 2 rounds (max 4 questions per round). Selection format with free-text via "Other".

Round 1: Draft, edit, prioritize

Generate 5–10 hypothesis drafts from Phase 1 (code) + Phase 2 (web) findings, present them, then ask:

Q	options
草案のうち採用するもの (multi-select)	`H1: ...` / `H2: ...` / ... / `Other` (add new hypothesis as free-text)
削除する仮説 (multi-select、任意)	`H3 を消す` / `H5 を消す` / ... / `なし`
優先順位の付け方	`skill 提案順 (新規性 + 工数低い順)` ⭐推薦 / `ユーザが手動で並び替え` / `期待効果順` / `Other`

Round 2: Shared metadata for the selected hypotheses

Q	options
最大 iteration 数 (この hypotheses.md 全体)	`仮説リストを使い切るまで` ⭐推薦 / `5` / `10` / `Other`
主要メトリクス (default)	`local self-play win rate vs. {opponent}` ⭐推薦 / `loss curve のみ` / `behavior 指標 (例: 序盤 neutral 確保率)` / `Other`
既定 episode 数	`≥300` ⭐推薦 (memory `project_imitation_case1_phase3` 由来) / `100` ⚠️ / `Other`
仮説間の依存関係	`なし` / `あり (Other で「H4 depends on H1」のように記述)`

Kaggle publicScore is never offered (project rule, memories project_om_finding / project_case5_validation).

Phase 4 — Skip list hearing (実施しない検証 / 評価)

AskUserQuestion, 1 round, multi-select with Other free-text. Bucketed by category:

Q1. Analysis skips (multi-select; "どれもスキップしない" を選んだ場合は他選択を無視して空とする)

どれもスキップしない (default) — skip list の分析項目はすべて実施 (default n<300 で結論を出さない のみ自動 ON)
300 対戦による評価はしない — 学習中 val_acc / loss curve や 30 ep 挙動確認だけで採否
replay 分析 (experiment-analysis) は実施しない
n<300 結果で結論を出さない ⭐default ON
Other

Q2. Execution skips

どれもスキップしない (default, Recommended)
smoke test (1-episode self-play) を skip
dev/test-bot を skip
RunPod GPU を使わない (local CPU only)
auto-recover loop を使わない (失敗時は手動)
Other

Conflict checks:

User selects Kaggle publicScore を採否に使う → reject (project rule, cite memories). Project default Kaggle publicScore は引用しない / skill rating は使わない are always ON and never asked here.
User selects n<300 で結論を出す → warn (cite memory project_imitation_case1_phase3) and confirm before allowing.

No individual confirmation dialog is required for smoke test skip / dev/test-bot skip — record the user's choice as-is.

Per-hypothesis exceptions: free-text input via Other. Examples like "Q1 で 300 対戦 skip だが H4 だけは 300 対戦したい" are recorded in Phase 5's 例外条件 section.

Phase 5 — Write `hypotheses.md`

Output path (per .claude/rules/docs.md):

docs/experiment/{family}/{yyyymmdd}_case{N}_{topic}/hypotheses.md

{yyyymmdd} — session injected currentDate (NOT shell date)
{topic} — snake_case from Phase 0
For "追記" mode, overwrite the same path (Phase 0 already backed it up).

Template:

# Hypotheses — {family}/case{N} {topic}

> 作成日: {yyyy-mm-dd}
> 最終更新: {yyyy-mm-dd}
> 状態: in_progress
> 最大 iteration: {N | リスト消化まで}
> 主要メトリクス: {例: ローカル self-play win rate vs. baseline_v5}
> 既定 episode 数: {300 | 100 | ...}

## 実施しない検証 / 評価 (skip list)

### 評価
- {例: ローカル self-play 300 対戦を行わない (学習 loss curve のみで採否)}
- {例: Kaggle publicScore は引用しない (project rule)}

### 分析
- {例: replay 分析は実施しない}
- {例: n<300 結果で結論を出さない}

### 実行
- {例: なし | smoke test skip ⚠️ | RunPod を使わない}

### 例外条件
- {例: H4 のみ「採否が inconclusive なら 300 対戦を追加実施」}

## 仮説リスト (priority 順)

- [ ] (P1) H1: {一文 hypothesis} — {期待効果 / mechanism}
- [ ] (P1) H2: {...}
- [ ] (P2, depends on H1) H3: {...}
- [ ] (P3) H4: {...}
- [ ] (deferred) H5: {...}  # 工数大 / 後回し

## Iteration log

(各 iter 完了時に experiment-analysis / experiment が追記)

| iter | 開始 | 仮説# | plan path | run_id | 主要メトリクス | 採否 | result path |
|---|---|---|---|---|---|---|---|

## 参考 (References)  # Phase 2 を実施した場合のみ

- {URL} — {1 行所感}

Write directly via Write, then report 3–5 lines to the user (in Japanese):

written path
hypothesis count + priority distribution
key skip-list entries
next-step suggestion: /experiment to consume the list, or /experiment-plan for a single iteration

Do not auto-spawn experiment-plan / experiment. Wait for the user to invoke them.

Risk gates this skill enforces

Never offer Kaggle publicScore as a metric. Skip-list keeps it default ON; reject if the user tries to remove it.
Warn on n<300. When 100-episode evaluation is selected in Phase 4, surface a one-line warning citing memory project_imitation_case1_phase3.
Confirm smoke / test-backend skips. One explicit confirmation dialog each before they take effect.
Do not edit code. Read-only; only hypotheses.md is written.
Time-box web research to ≤ 5 minutes. Deeper surveys belong to /research-retrieval.
Don't silently overwrite an existing hypotheses.md. Phase 0 must offer rebuild / append / read-only.

Common shapes

User says…	Skill behavior
`/experiment-hypothesize imitation/case1 dropout`	Phase 0 skipped (target given). Phase 1 reads case1. Phase 2 likely 不要. Phases 3–5 produce 5–10 hypotheses + skip list and write `hypotheses.md`.
"case4 の harass 周りで仮説を並べて、Lux S3 の解法も参考にして"	Phase 0 confirms case4. Phase 2 runs web research on Kaggle Lux writeups.
"case1 の hypotheses.md もう古いから作り直して"	Phase 0 picks `再構築`, backs up the existing file, resets.
"実験プランを書いて" (vague)	Belongs to `experiment-plan`. Redirect.

Things to avoid

Running this skill more than once for the same case to grow hypotheses.md. Rule: 1 case = 1 hypotheses.md.
Running Phase 2 web research for more than 5 minutes. Deeper surveys belong to another skill.
Writing iterN_plan.md content into hypotheses.md. Keep file responsibilities separate.
Leaving the skip list empty. Project defaults (Kaggle publicScore は引用しない, n<300 で結論を出さない) must always appear.
Overwriting an existing hypotheses.md without user confirmation.
Including Kaggle publicScore as a success metric on any hypothesis.
Auto-spawning experiment-plan or experiment after writing the file. Suggest by name only.

Language

Internal reasoning and thinking should be in English
All user-facing output, AskUserQuestion labels/descriptions, and the written hypotheses.md body must be in Japanese (per .claude/CLAUDE.md)

experiment-hypothesize

Experiment Hypothesize Skill (Orbit Wars)

When this skill is in charge

Scope boundaries (what this skill does NOT do)

Skill flow

Phase 0 — Confirm the target case

Phase 1 — Code investigation (read-only)

Phase 2 — Web research (opt-in)

Phase 3 — Hypothesis hearing

Phase 4 — Skip list hearing (実施しない検証 / 評価)

Phase 5 — Write hypotheses.md

Risk gates this skill enforces

Common shapes

Things to avoid

Language

Experiment Hypothesize Skill (Orbit Wars)

When this skill is in charge

Scope boundaries (what this skill does NOT do)

Skill flow

Phase 0 — Confirm the target case

Phase 1 — Code investigation (read-only)

Phase 2 — Web research (opt-in)

Phase 3 — Hypothesis hearing

Phase 4 — Skip list hearing (実施しない検証 / 評価)

Phase 5 — Write hypotheses.md

Risk gates this skill enforces

Common shapes

Things to avoid

Language

Phase 5 — Write `hypotheses.md`

Phase 5 — Write `hypotheses.md`