| name | projectclownfish-cluster-worker |
| description | Use when operating ProjectClownfish cluster workflows in this repo: reviewing GitHub Actions run artifacts, tuning worker/applicator guardrails, importing ghcrawl clusters, dispatching canaries, scaling batches, or deciding whether autonomous write-mode cluster cleanup is safe to ramp. |
ProjectClownfish Cluster Workflow
Use this skill for ProjectClownfish operations in this repo. It is not just a one-job worker skill; it is the fast path for running the whole guarded cluster loop.
Hard Stance
- Execute/autonomous gates default on for live ProjectClownfish work. Set
CLOWNFISH_ALLOW_EXECUTE=0 or CLOWNFISH_ALLOW_FIX_PR=0 only when intentionally pausing mutations.
- Do not dispatch a broad write-mode batch if the last canary failed, has unreviewed artifacts, or any stale pre-fix run is still active.
- Treat GitHub Actions repository variables as captured at workflow start. Resetting
CLOWNFISH_ALLOW_EXECUTE does not revoke already-started runs.
- If a stale run was started with old guardrails and write mode enabled, cancel it before scaling.
- Codex workers never mutate GitHub directly. They emit JSON;
scripts/execute-fix-artifact.mjs owns guarded fix PR execution and scripts/apply-result.mjs owns guarded close/merge replay.
- Only the applicator may record
executed. Worker output containing executed is a bug.
- Closed historical refs are evidence only. They must not receive
close_* actions.
- Security-sensitive refs do not belong in ProjectClownfish mutation. Quarantine vulnerability, advisory, CVE/GHSA, leaked secret, credential/token/API-key, plaintext secret storage, SSRF/XSS/CSRF/RCE, security-class injection, exploitability, or sensitive-data exposure refs with
route_security, and keep unrelated non-security work moving.
Recovery Check
Start every session here:
pwd
git rev-parse --show-toplevel
git branch --show-current
git status --short --branch
df -h .
test -e node_modules && { ls -ld node_modules; test -L node_modules && echo node_modules_symlink=yes || echo node_modules_symlink=no; } || echo node_modules_missing
Then check live workflow state and safety vars:
gh run list --repo openclaw/clownfish --workflow cluster-worker.yml --limit 10 \
--json databaseId,headSha,status,conclusion,createdAt,updatedAt,url \
--jq '.[] | {databaseId,headSha,status,conclusion,createdAt,updatedAt,url}'
gh variable list --repo openclaw/clownfish --json name,value \
--jq 'map(select(.name|test("^CLOWNFISH_"))) | sort_by(.name) | .[] | {name,value}'
Review Results
For every completed run that matters:
rm -rf /tmp/clownfish-check-RUN_ID
mkdir -p /tmp/clownfish-check-RUN_ID
gh run download RUN_ID --repo openclaw/clownfish --dir /tmp/clownfish-check-RUN_ID
npm run review-results -- /tmp/clownfish-check-RUN_ID
Summarize artifacts:
find /tmp/clownfish-check-RUN_ID -name result.json -print -quit |
xargs jq '{status,summary,actions_total:(.actions|length),action_counts:(.actions|group_by(.action)|map({action:.[0].action,count:length}))}'
find /tmp/clownfish-check-RUN_ID -name apply-report.json -print -quit |
xargs jq '{totals:{executed:([.actions[]? | select(.status=="executed")]|length),blocked:([.actions[]? | select(.status=="blocked")]|length),skipped:([.actions[]? | select(.status=="skipped")]|length),planned:([.actions[]? | select(.status=="planned")]|length)}}'
find /tmp/clownfish-check-RUN_ID -name fix-execution-report.json -print -quit |
xargs jq '{status,actions}'
If review fails, inspect the failure class before doing anything else:
executed in worker result: tighten schema, prompts, and scripts/review-results.mjs.
close action targets closed item: tune prompts and planner so closed context refs are evidence-only; use keep_closed.
- long
Run worker: reduce prompt size by hydrating canonical + open candidates only; add/verify CLOWNFISH_CODEX_TIMEOUT_MS.
- applicator blocked because target changed: rerun the job against fresh state, do not force apply.
Tune Engine
Use repo scripts and prompts as the control plane:
schemas/codex-result.schema.json: what Codex may emit.
prompts/worker-system.md, prompts/autonomous.md, prompts/execute.md, prompts/plan-only.md: worker behavior.
instructions/low-signal-prs.md: opt-in manual backlog cleanup policy for random docs churn, blank-template PRs, test-only spam, third-party capability PRs that belong on ClawHub, risky infra drive-bys, and dirty branches.
scripts/review-results.mjs: deterministic artifact gate.
scripts/plan-cluster.mjs: what gets hydrated into the prompt.
scripts/execute-fix-artifact.mjs: deterministic branch repair/replacement PR gate.
scripts/apply-result.mjs: deterministic mutation gate.
scripts/post-flight.mjs: deterministic post-execution finalizer for ProjectClownfish fix PRs and post-merge closeouts.
scripts/import-ghcrawl-low-signal-prs.mjs: local ghcrawl open-PR scanner for opt-in low-signal cleanup jobs.
.github/workflows/cluster-worker.yml: runner behavior and env capture.
Current autonomy posture:
- Hydrate comments and PR review comments by default before model execution.
- Hydrate cluster refs and bounded first-hop linked refs so closed representative drift can often be resolved without human review.
- Treat failing checks as a merge/fixed-by-candidate blocker, not a reason to stop classifying the whole cluster.
- Treat security-sensitive refs as scoped quarantine. Emit/expect
route_security for that ref only; keep processing unrelated non-security duplicates, bugs, provider gaps, and fix artifacts.
- Treat missing
merge_preflight as a hard merge blocker. Merge preflight must prove security clearance, resolved human comments, resolved review-bot comments, passed Codex /review, addressed findings, and validation commands.
- For
openclaw/openclaw fix artifacts, validation commands must use repo-native pnpm lanes such as pnpm test:serial <path-or-filter...>, pnpm -s vitest run <files>, and pnpm check:changed; npm run validate is not a valid target-repo gate.
- Let
execute-fix-artifact run the agentic merge-prep loop for fix PRs: edit, validate, Codex /review, address findings, revalidate, then resolve review threads when CLOWNFISH_RESOLVE_REVIEW_THREADS=1.
- Prepare the target repo toolchain before the agentic edit/review loop. For OpenClaw this means Node 22+, Corepack, the target
packageManager, and pnpm install --frozen-lockfile; only set CLOWNFISH_INSTALL_TARGET_DEPS=0 for debugging failed setup.
- Review failed worker artifacts before requeueing. The workflow must upload worker artifacts even when
review-results fails; a missing artifact after a failed review is a workflow bug, not an acceptable blind retry.
- Replacement fix PR execution must use the recoverable target branch
clownfish/<cluster-id>. If that branch already exists, resume it instead of starting from scratch. After agent edits and review-fix edits, commit and push checkpoint commits to that branch before expensive validation/review gates so a timed-out run can be requeued without losing the patch. Do not open the PR until validation and Codex /review pass.
- Resumed replacement branches may be rebased and narrowly refactored onto current
origin/main. If the rebase conflicts, let the executor run the Codex rebase-repair loop, resolve conflict markers, continue the rebase, then proceed through the normal validation/review gate. Tune CLOWNFISH_REBASE_REPAIR_ATTEMPTS instead of disabling the rebase gate.
- Useful but uneditable or unsafe source PRs are replacement candidates, not human blockers. When a canonical PR is draft, stale, unmergeable, has
maintainer_can_modify=false, or has broad unrelated churn, emit or execute replace_uneditable_branch with full source PR credit instead of waiting for a maintainer decision.
- Fix execution should provide Codex actual repo-discovery context before editing; repeated "no target repo changes" means tune
scripts/execute-fix-artifact.mjs before replaying more jobs. GitHub Actions may block Codex bwrap write/review sandboxes, so write-mode and review execution default to danger-full-access there after tokens are stripped from the Codex environment. A Codex write preflight must fail fast before the expensive repair loop if sandbox/auth/write access is broken; do not wait through multi-attempt edits to discover startup failures. Keep canary execution bounded: default worker timeout is 30 minutes, build-PR step timeout is 30 minutes, fix Codex edit budget is 20 minutes with reserve for artifact writing, preflight timeout is 2 minutes, Codex model is gpt-5.5, and Codex reasoning effort is medium. Worker timeout/failure and exhausted /review attempts must write blocked artifacts and keep the workflow reporting path alive. Fix executor runs must copy Codex debug logs into the run artifact so timeout failures are inspectable.
- Match OpenClaw's CI fast lane for fix validation. Use
blacksmith-4vcpu-ubuntu-2404 for cluster planning/review and blacksmith-16vcpu-ubuntu-2404 for fix/apply execution. The executor sets OPENCLAW_LOCAL_CHECK=0 and treats pnpm check:changed plus diff checks as the default hard gate. It normalizes target validation commands to pnpm check:changed unless CLOWNFISH_TARGET_VALIDATION_MODE=strict or CLOWNFISH_STRICT_TARGET_VALIDATION=1 is explicitly set, so unrelated flaky main CI and broad suites do not block narrow ProjectClownfish fixes.
- After fix execution, run post-flight finalization before the final closeout replay. Post-flight may merge only ProjectClownfish-opened/pushed fix PRs, only after merge preflight, security clearance, resolved review threads, and non-ignored checks are clean. Default ignored checks are
auto-response, Labeler, and Stale; configure CLOWNFISH_POST_FLIGHT_IGNORE_CHECKS rather than broadening the hard gate in code.
- Prefer
keep_related, keep_independent, keep_closed, fix_needed, route_security, and subcluster notes over blanket needs_human.
- Use
needs_human only for the exact maintainer decision still unresolved after hydrated evidence is reviewed.
- Worker results must use one action per issue/PR ref. Never emit comma-separated action targets; related follow-up subclusters should be one
keep_related action per ref or one cluster-scoped fix_needed action.
- Close-action
canonical, duplicate_of, and candidate_fix refs must come from hydrated preflight items. If a PR is only mentioned in comments or previous ProjectClownfish notes, keep it as evidence/fix-artifact context until a refreshed plan hydrates it.
- Broad feature/config/docs rewrites are not autonomous executor work. If a fix artifact crosses many implementation, config/schema, docs, and test surfaces, split it into narrower follow-up jobs or let
execute-fix-artifact block it. Override only with CLOWNFISH_ALLOW_BROAD_FIX_ARTIFACTS=1.
After tuning, run:
node --check scripts/plan-cluster.mjs
node --check scripts/import-ghcrawl-clusters.mjs
node --check scripts/run-worker.mjs
node --check scripts/post-flight.mjs
npm run validate
git diff --check
Do a narrow planner smoke before committing hydration changes:
rm -rf /tmp/clownfish-plan-check
node scripts/plan-cluster.mjs jobs/openclaw/ghcrawl-143793-autonomous-smoke.md \
--offline --run-dir /tmp/clownfish-plan-check
jq '{items:(.items|length),seed_refs:(.scope.seed_refs|length),context_refs:(.scope.context_refs|length),hydrate_cluster_refs:.scope.hydrate_cluster_refs}' \
/tmp/clownfish-plan-check/cluster-plan.json
For a needs-human reduction smoke, verify the artifact includes real comment
and review-comment excerpts:
jq '{items:(.items|length), comment_items:([.items[] | select(.comments_hydrated > 0)] | length), review_comment_prs:([.items[] | select(.pull_request.review_comments_hydrated > 0)] | length)}' \
/tmp/clownfish-plan-check/cluster-plan.json
Generate Batch Jobs
Use ghcrawl read-only inspection first:
ghcrawl doctor --json
ghcrawl configure --json
ghcrawl clusters openclaw/openclaw --min-size 2 --limit 80 --sort size --json |
jq -r '.clusters[] | select(.isClosed == false) | [.clusterId,.totalCount,.issueCount,.pullRequestCount,.latestUpdatedAt,.displayTitle] | @tsv'
rg -o 'ghcrawl-[0-9]+' jobs/openclaw -g '*.md' |
sed -E 's/.*ghcrawl-([0-9]+).*/\1/' | sort -n | uniq | tr '\n' ' '
Pick the largest active clusters not already imported, then generate autonomous job files:
node scripts/import-ghcrawl-clusters.mjs --from-ghcrawl --limit 40 \
--repo openclaw/openclaw \
--mode autonomous \
--suffix autonomous-smoke \
--allow-instant-close \
--allow-merge \
--allow-fix-pr \
--allow-post-merge-close
The importer skips existing ghcrawl IDs and fully security-sensitive clusters by default. Mixed clusters are allowed so the worker can route security refs and continue ordinary bug/dedupe work.
Validate before committing:
npm run validate
Commit engine changes separately from generated job batches when practical:
git add prompts schemas scripts .github
git commit -m "fix: scope autonomous cluster workflow"
git add jobs/openclaw/ghcrawl-*-autonomous-smoke.md
git commit -m "chore: add next autonomous cluster jobs"
git push origin main
Dispatch Policy
Do not jump straight to 20 write-mode jobs. Sequence:
- Ensure no stale active runs on old SHAs.
- Ensure
CLOWNFISH_ALLOW_EXECUTE=1 and CLOWNFISH_ALLOW_FIX_PR=1 unless the operator intentionally paused live work.
- Dispatch 2-3 canaries on the latest pushed SHA.
- Review artifacts and applicator reports.
- Only then dispatch a wider batch.
Canary dispatch:
npm run dispatch -- \
jobs/openclaw/ghcrawl-ID1-autonomous-smoke.md \
jobs/openclaw/ghcrawl-ID2-autonomous-smoke.md \
--mode autonomous \
--runner blacksmith-4vcpu-ubuntu-2404 \
--execution-runner blacksmith-16vcpu-ubuntu-2404
Important: after dispatch, already-started runs keep the write gate they captured. If a new bug is found, cancel those runs.
Single-job requeue after calibration:
npm run requeue -- 24947178021
npm run requeue -- 24947178021 --execute --open-execute-window \
--runner blacksmith-4vcpu-ubuntu-2404 \
--execution-runner blacksmith-16vcpu-ubuntu-2404
Use a run id when you want to replay the same source job from an artifact, or a
job path when you already know the file. The script opens both mutation gates for
live execute/autonomous requeues and closes them after the queued run starts.
For plan-only scaling, keep write gate off and dispatch with --mode plan or --dry-run where appropriate.
Low-Signal PR Sweeps
Use this only for manual backlog cleanup and random drive-by PR triage. It is not dedupe and it must stay separate from duplicate/superseded/fixed-by-candidate closeouts.
Generate staged jobs from local ghcrawl data:
npm run import-low-signal -- --limit 20 --batch-size 5 --mode autonomous --sort stale
Generated jobs set triage_policy: low_signal_prs and allow_low_signal_pr_close: true. The worker may emit close_low_signal only for open pull requests that pass instructions/low-signal-prs.md.
Before live dispatch:
- inspect the generated job candidates;
- commit and push the jobs so Actions can read them;
- dispatch 1-2 canaries first;
- review artifacts before scaling the next batches.
Self-Heal Failed Jobs
Use self-heal after reviewing the failed artifacts and tuning obvious deterministic guardrail issues.
Dry-run candidate selection:
npm run self-heal
This selects only the latest failed run per source job, skips jobs that have a later success, and skips jobs already retried in results/self-heal.json.
Live one-attempt retry:
npm run self-heal -- --execute --open-execute-window --max-jobs 5 \
--runner blacksmith-4vcpu-ubuntu-2404 \
--execution-runner blacksmith-16vcpu-ubuntu-2404
The local live path temporarily opens gates only when needed, dispatches the retry jobs, waits until the new runs have started, records the ledger, and restores the prior gate values.
If using the manual self-heal failed clusters workflow, keep it dry-run by default. For execute mode, open the execution gate before triggering it or it should fail before dispatching write-mode jobs.
Ramp Decision
Say "safe to ramp" only when all are true:
- latest canaries run on the current SHA;
- no worker result uses
executed;
- no close action targets a closed item;
- applicator executed only planned duplicate/superseded/fixed-by-candidate close actions or guarded clean merge actions;
- every merge action had passing
merge_preflight, and live GitHub review threads were resolved before merge;
- useful contributor PRs were either repaired when maintainer-editable or have a replacement fix artifact with source PR credit before superseded closeout;
CLOWNFISH_ALLOW_EXECUTE and CLOWNFISH_ALLOW_FIX_PR are back to their intended default values;
- active runs are expected and on the intended SHA;
- artifacts are downloaded or easy to retrieve by run URL.
If not, say exactly what blocked the ramp and patch that first.