Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

$pwd:

senpai-status-check

Name: Senpai Status Check
Author: wandb

// Produce a fresh status report for the senpai ML experiment fleet. Use when the user asks for an experiment status, final status, PR/W&B/pod health check, stale student triage, training shutdown harvest, advisor-state audit, or a "what is really happening right now?" report. The report must prioritize paper-facing test metrics over validation metrics and compare test results to dataset benchmarks or targets.

In Manus ausführen

$ git log --oneline --stat

stars:13

forks:3

updated:25. April 2026 um 12:15

SKILL.md

readonly

related-skills.json

gleiches Repository

bootstrap-target.md

from "wandb/senpai"

Create or improve Senpai target-repository onboarding files: program.md plus instructions/prompt-advisor.md and instructions/prompt-student.md. Use this skill whenever the user wants to point Senpai at a fresh ML or research target repository, define the research objective, primary metric, benchmark contract, allowed edit boundaries, W&B reporting contract, advisor/student prompts, or prepare a repo for autonomous advisor/student experiment loops.

2026-05-2113

senpai-gh.md

from "wandb/senpai"

GitHub CLI primitives for the senpai research workflow — label swaps, send-back, close, mark-review, issue checks, PR queries. Use this skill whenever you need to manipulate PR labels, send a PR back to a student, close a dead-end experiment, mark a PR for review, or query the current state of PRs and issues. Also triggers for: "swap labels", "send back to student", "close this PR", "mark for review", "check human issues", "list review-ready PRs", "idle students".

2026-05-1413

survey-prs.md

from "wandb/senpai"

Survey all experiment PRs on a branch and return a structured status report: which students are idle, which PRs await review, which are WIP. This is the heartbeat query — use it to understand the current state of the research track. Triggers for: "survey state", "check PR status", "who's idle", "any PRs ready for review", "what's the current state".

2026-05-1413

merge-winner.md

from "wandb/senpai"

Squash-merge a winning experiment PR and update the baseline. Handles the merge, BASELINE.md update, commit, push, and branch pull. Also handles merge conflicts by sending the PR back for rebase. Use this skill to: merge a winning PR, update baseline, squash merge experiment. Triggers for: "merge winner", "merge this PR", "update baseline after merge", "squash merge".

2026-05-0413

submit-experiment-results.md

from "wandb/senpai"

Submit experiment results for advisor review. Commits changes, pushes the branch, marks the PR as ready, and swaps the status label from wip to review. Use this skill when you've finished running experiments and posted your results comment. Triggers for: "submit for review", "mark PR ready", "send results to advisor", "submit experiment results".

2026-05-0413

wandb-primary.md

from "wandb/senpai"

Comprehensive primary skill for agents working with Weights & Biases. Covers both the W&B SDK (training runs, metrics, artifacts, sweeps) and the Weave SDK (GenAI traces, evaluations, scorers). Includes helper libraries, gotcha tables, and data analysis patterns. Use this skill whenever the user asks about W&B runs, Weave traces, evaluations, training metrics, loss curves, model comparisons, or any Weights & Biases data — even if they don't say "W&B" explicitly. Also trigger on training-curve diagnostics questions — run health, divergence, overfit/convergence/plateau, spikes, LR-schedule/grad-norm/grad-histogram reading, dead layers, step-axis choice, and run comparisons.

2026-04-2813

package.json

"author": "wandb"

"repository": "wandb/senpai"

GitHub-Repository öffnen Creator-Repositorys ansehen

$ install --global

$ download --local

In Manus ausführen

$ useful --forSOC

DatenwissenschaftlerInformatik- und Mathematikberufe15-2051L4

name

senpai-status-check

description

Produce a fresh status report for the senpai ML experiment fleet. Use when the user asks for an experiment status, final status, PR/W&B/pod health check, stale student triage, training shutdown harvest, advisor-state audit, or a "what is really happening right now?" report. The report must prioritize paper-facing test metrics over validation metrics and compare test results to dataset benchmarks or targets.

Experiment Status Check

Create a timestamped analysis/STATUS_*.md report that answers:

What are the best paper-facing test results, and how do they compare to benchmark targets?
Is the advisor/student system actually doing useful work, or is it stalled, mislabeled, sleeping, or optimizing the wrong thing?

The core distinction is "is the fleet alive?" versus "is useful science happening?" Keep those separate throughout the report.

Why this skill exists

This is an intervention tool, not a newsletter. It should help decide whether to keep training, harvest results, repair operations, or restart with new instructions. It guards against:

Validation drift: agents find plausible validation wins and accidentally report them like paper wins. The paper needs test metrics.
Fleet illusion: pods can be Running while student sessions are idle, sleeping, mis-assigned, or done.
Assignment invisibility: a PR without the right student:<name> and branch labels can look assigned to the advisor while being invisible to the student pod.
Stale advisor state: CURRENT_RESEARCH_STATE.md can lag behind GitHub, W&B, and pod reality.
Experiment sprawl: many PRs can be active while only a few have a credible path to improving paper-facing test results.

Non-negotiables

Put test metrics before validation metrics. Validation is a steering signal, not the paper result. This prevents the report from overstating progress.
Always pair a test metric with its benchmark target/reference and a gap/read: "beats", "misses by X", "internal benchmark only", or "no external scalar".
Include source and caveat columns for metrics: W&B run ID, PR number, full-eval vs batch-limited, best-checkpoint vs final epoch, truncated vs full.
Do not trust advisor state files as source of truth until cross-checked against GitHub labels, W&B, and pod/process state.
Do not treat a Running pod as active training. Check for real train.py processes and recent W&B runs.
Surface broken labels, pods, polling, or advisor state in the executive read.
Do not report "SOTA" from validation unless the benchmark contract is validation-based. For this programme, most paper claims are test-facing.

Evidence order and why

Metrics: W&B summary/history first, then PR comments with run IDs, then advisor state, then old status files. W&B is closest to the run; old status files are continuity, not truth.
Operations: Kubernetes process state first, then raw .claude JSONL logs, then GitHub labels/timestamps, then advisor state. Pod/process truth shows what is actually happening now.

Questions to answer

Area	Ask	Why
Scope	Which branch, repo, W&B project, k8s context, and prior statuses are in scope? What was checked live?	Prevents stale or cross-branch evidence from driving decisions.
Science	What is the best verified test result per dataset, the exact metric contract, the benchmark target, and the gap? Which results are val-only or truncated?	Separates paper-ready wins from internal steering signals.
Harvest	Which live or recent runs deserve full-test harvest before shutdown? Which negative results stop whole families?	Spends remaining GPU on paper-facing upside.
PR queue	How many PRs are WIP/review/draft/stale? Are WIP PRs labeled with branch + exactly one `student:*`?	Finds invisible work before pods sit idle.
Fleet	Which pods are Ready, which have real `train.py`, and which are Running but not training?	Avoids mistaking infrastructure health for useful experiments.
Raw logs	Do student/advisor logs show active training, no-PR loops, submitted/done states, setup failures, or sleep/monitor loops?	Explains why work is or is not moving.
Advisor	What does `CURRENT_RESEARCH_STATE.md` claim, and what do GitHub/W&B/pods contradict?	Stops stale or validation-heavy steering from consuming the budget.

Command patterns

Adjust repo, branch, context, and project names to the request. These are patterns, not mandatory copy-paste blocks.

# Status history and headings
rg --files -g 'STATUS*.md' -g 'analysis/STATUS*.md' | sort
rg -n '^#{1,3} ' analysis/STATUS*.md

# PR queue
gh pr list --repo wandb/senpai --base radford --state open --limit 200 \
  --json number,title,state,createdAt,updatedAt,isDraft,headRefName,labels,url

# Detect WIP PRs missing student/branch labels
gh pr list --repo wandb/senpai --base radford --state open --limit 200 \
  --json number,headRefName,labels,createdAt > /tmp/open_prs.json
uv run python - <<'PY'
import json, datetime
prs = json.load(open('/tmp/open_prs.json'))
now = datetime.datetime.now(datetime.timezone.utc)
for p in sorted(prs, key=lambda x: x["number"]):
    labels = sorted(l["name"] for l in p["labels"])
    bad = "status:wip" in labels and (
        "radford" not in labels or not any(x.startswith("student:") for x in labels)
    )
    if bad:
        age = (now - datetime.datetime.fromisoformat(p["createdAt"].replace("Z","+00:00"))).total_seconds()/3600
        intended = p["headRefName"].split("/", 1)[0]
        print(f"#{p['number']} age={age:.1f}h intended={intended} labels={labels} head={p['headRefName']}")
PY

# Fleet overview and real training process sweep
kubectl --context pai-2 get pods -l app=senpai -o wide
for podref in $(kubectl --context pai-2 get pods -l app=senpai,role=student -o name | sort); do
  pod=${podref#pod/}
  student=${pod#senpai-}; student=${student%-*}; student=${student%-*}
  out=$(kubectl --context pai-2 exec "$pod" -- sh -lc "ps -eo pid,etime,comm,args | awk '\$3 ~ /python/ && /train.py/ {print; exit}'" 2>/dev/null || true)
  if [ -n "$out" ]; then
    printf "%-10s PYTRAIN %s\n" "$student" "$(printf "%s" "$out" | sed 's/^[[:space:]]*//; s/[[:space:]][[:space:]]*/ /g' | cut -c1-180)"
  else
    printf "%-10s NO_PYTRAIN\n" "$student"
  fi
done

# Advisor state and raw log locations
gh api 'repos/wandb/senpai/contents/research/CURRENT_RESEARCH_STATE.md?ref=radford' \
  --jq .content | base64 --decode | sed -n '1,220p'
kubectl --context pai-2 exec deploy/senpai-advisor -- sh -lc \
  "find /root/.claude -type f -name '*.jsonl' -printf '%T@ %p %s\n' | sort -nr | head -10"

# Student raw log locations and latest tails
for podref in $(kubectl --context pai-2 get pods -l app=senpai,role=student -o name | sort); do
  pod=${podref#pod/}
  echo "### $pod"
  kubectl --context pai-2 exec "$pod" -- sh -lc \
    "latest=\$(find /root/.claude -type f -name '*.jsonl' -printf '%T@ %p\n' | sort -nr | awk 'NR==1{sub(/^[^ ]+ /,\"\"); print}'); [ -n \"\$latest\" ] && tail -40 \"\$latest\""
done

W&B metric scan pattern

Use the wandb-primary skill. Directly fetch known run IDs from PRs, then scan recent runs only as needed. Keep dataset classification conservative; do not mix TFP surface_mse with AirfRANS surface_mse.

Minimum metric keys:

DrivAerML: test_primary/surface_rel_l2_pct
TandemFoil Paper: best_test_primary/field_mse, test_primary/field_mse.
AirfRANS: full_test/surface_mse, full_test/volume_mse, best_full_test/surface_mse, best_full_test/volume_mse.
TandemFoil: best_test_primary/surface_pressure_mae, test_primary/surface_pressure_mae.

Always record W&B run name/ID, PR if known, run state, metric key, and caveat.

Status file template

Write to:

analysis/STATUS_<YYYY-MM-DD-HHMM>_<branch>_<short_topic>.md

Use this compact wireframe:

# STATUS <YYYY-MM-DD HH:MM TZ> - <branch> <short title>

Collected at `<UTC timestamp>`.

Sources checked: GitHub `<repo>/<branch>`, W&B `<entity/project>`, k8s
`<context>`, raw logs `<yes/no>`, prior statuses `<paths>`.
Changes made during this pass: `<none, or exact labels/pods/branches touched>`.

## Executive read

<5-10 bullets. Lead with test frontier and operational stalls.>

## Test metric frontier

| Dataset | Contract | Best verified test | Target/reference | Gap/read | Source | Caveat |
| --- | --- | ---: | ---: | --- | --- | --- |
| DrivAerML | `test_primary/surface_rel_l2_pct` |  |  |  | W&B `<id>`, PR `#` | full-eval / truncated / final |
| TandemFoil Paper | `test_primary/field_mse` |  |  |  | W&B `<id>`, PR `#` | best-checkpoint / final |
| AirfRANS | `full_test/surface_mse`, `full_test/volume_mse` |  |  |  | W&B `<id>`, PR `#` | pair metric |
| TandemFoil | `test_primary/surface_pressure_mae` |  |  |  | W&B `<id>`, PR `#` | val-selected / final |

## PR queue and label audit

Counts: open `<n>`, WIP `<n>`, review `<n>`, draft `<n>`, older than 6h `<n>`,
missing `student:*` `<n>`, missing branch label `<n>`.

| PR | Age | Student/read | Labels | Risk | Fix |
| --- | ---: | --- | --- | --- | --- |

## Fleet and raw-log health

| Signal | Count/examples | Evidence | Read |
| --- | ---: | --- | --- |
| Pods Running/Ready |  | `kubectl` |  |
| Real `train.py` |  | process sweep |  |
| Running but not training |  | process sweep + `.claude` tail |  |
| Recent W&B runs |  | W&B scan |  |

## Advisor state risks

State what `CURRENT_RESEARCH_STATE.md` says, what GitHub/W&B/pods contradict,
whether validation is over-weighted, and what intervention is needed.

## Harvest / keep / kill

| Priority | Run/PR | Current signal | Missing test metric | Why it matters | Action |
| --- | --- | --- | --- | --- | --- |

## Bottom line

<Short decision: are we scientifically improving, operationally healthy, both,
or neither? Name the next 1-3 moves.>

Reporting guidance

For final-hours checks, add a "shutdown harvest" section and rank only runs likely to improve paper-facing test.
For stale-pod checks, put pod/process truth above PR commentary.
For label failures, include the deterministic repair mapping but do not apply labels unless the user explicitly asks.
For advisor drift, quote the state file only briefly and paraphrase the mismatch with evidence.
For best metrics, include both "paper-safe" and "literal lowest observed" rows when selection protocol differs, and label the caveat clearly.
Prefer small tables over long prose. The reader is deciding what to do next with expensive GPUs.

Repeat this at the end

Before finishing any status check, explicitly answer these five questions:

Best test results: What is the best verified paper-facing test metric for each dataset, what benchmark or target is it compared to, and what is the gap?
Useful work: Which pods are actually training or harvesting full-test results right now, and which are merely Running?
Queue health: Are any WIP PRs invisible because labels, branch labels, draft state, or stale assignment state are wrong?
High-upside actions: Which 1-5 runs or PRs deserve the remaining GPU budget because they could improve a test result, not just validation?
Decision: Should the system keep running, harvest tests, relabel/restart stalled students, close low-value work, or restart with new instructions?

If the answer does not make the next operational decision obvious, tighten the report before handing it back.

senpai-status-check

Mehr aus diesem Repository

Mehr aus diesem Repository

Experiment Status Check

Why this skill exists

Non-negotiables

Evidence order and why

Questions to answer

Command patterns

W&B metric scan pattern

Status file template

Reporting guidance

Repeat this at the end

Experiment Status Check

Why this skill exists

Non-negotiables

Evidence order and why

Questions to answer

Command patterns

W&B metric scan pattern

Status file template

Reporting guidance

Repeat this at the end