Jeden Skill in Manus ausführen
mit einem Klick

Jeden Skill in Manus mit einem Klick ausführen

os-evolution-verifier

Sterne3

Forks2

Aktualisiert7. Juni 2026 um 17:46

Verifies that os-architect actually causes evolution — not just words. Dispatches os-architect in single-shot simulation mode for a given test scenario, then checks for real artifact presence (new files, HANDOFF_BLOCK, plan files). Reports PASS / FAIL with grep evidence. Accumulates results into a test report. Use after any changes to os-architect, os-evolution-planner, or improvement-intake-agent.

Installation

Mit Codex oder Claude installieren Kopieren Sie diesen Prompt, fügen Sie ihn in Codex, Claude oder einen anderen Assistant ein und lassen Sie die Skill-Seite prüfen und installieren.

In Manus ausführen

Quelle

richfrem

richfrem/agent-plugins-skills

GitHub-Repository öffnen Creator-Repositorys ansehen

Download

In Manus ausführen

Verwandte BerufeSOC

Basierend auf der SOC-Berufsklassifikation

Softwarequalitätssicherungsanalysten und -testerInformatik- und Mathematikberufe·SOC 15-1253

Datei-Explorer

5 Dateien

SKILL.md

readonly

Mehr aus diesem Repository

gleiches Repository

agent-swarm

richfrem/agent-plugins-skills

(Industry standard: Parallel Agent) Primary Use Case: Work that can be partitioned into independent sub-tasks running concurrently across multiple agents. Parallel multi-agent execution pattern. Use when: work can be partitioned into independent tasks that N agents can execute simultaneously across worktrees. Includes routing (sequential vs parallel), merge verification, and correction loops.

2026-06-083

dual-loop

richfrem/agent-plugins-skills

(Industry standard: Sequential Agent / Agent as a Tool) Primary Use Case: Delegating a well-defined task to a worker agent, verifying its execution, and repeating if necessary. Inner/outer agent delegation pattern. Use when: work needs to be delegated from a strategic controller (Outer Loop) to a tactical executor (Inner Loop) via strategy packets, with verification and correction loops.

2026-06-083

learning-loop

richfrem/agent-plugins-skills

(Industry standard: Loop Agent / Single Agent) Primary Use Case: Self-contained research, content generation, and exploration where no inner delegation is required. Self-directed research and knowledge capture loop. Use when: starting a session (Orientation), performing research (Synthesis), or closing a session (Seal, Persist, Retrospective). Ensures knowledge survives across isolated agent sessions.

2026-06-083

orchestrator

richfrem/agent-plugins-skills

(Industry standard: Routing Agent / Orchestrator Pattern) Primary Use Case: Analyzing an ambiguous trigger and routing it to one of the specific specialized implementations. Routes triggers to the appropriate agent-loop pattern. Use when: assessing a task, research need, or work assignment and deciding whether to run a simple learning loop, red team review, dual-loop delegation, or parallel swarm. Manages shared closure (seal, persist, retrospective, self-improvement).

2026-06-083

red-team-review

richfrem/agent-plugins-skills

(Industry standard: Review and Critique Pattern) Primary Use Case: Iterative generation paired with adversarial review, continuing until an 'Approved' verdict is reached. Orchestrated adversarial review loop. Use when: research, designs, architectures, or decisions need to be reviewed by red team agents (human, browser, or CLI). Iterates in rounds of research → bundle → review → feedback until approved.

2026-06-083

triple-loop-learning

richfrem/agent-plugins-skills

(Industry standard: Meta-Learning System / Automated Autoresearch) Primary Use Case: Continuous, self-improving orchestration of an agentic system over multiple sessions. Use when: building a continuous improvement layer that autonomously identifies workflow friction, postulates hypotheses, and tests improved instructions/coding skills against an objective headless benchmark before merging and persisting.

2026-06-083

name	os-evolution-verifier
plugin	agent-agentic-os
description	Verifies that os-architect actually causes evolution — not just words. Dispatches os-architect in single-shot simulation mode for a given test scenario, then checks for real artifact presence (new files, HANDOFF_BLOCK, plan files). Reports PASS / FAIL with grep evidence. Accumulates results into a test report. Use after any changes to os-architect, os-evolution-planner, or improvement-intake-agent.
argument-hint	[test-scenario-file \| all]
tools	["Bash","Read","Write"]

Overview

After evolving os-architect or its downstream agents, you need proof that the changes actually work. This skill dispatches os-architect in single-shot simulation mode for each test scenario and verifies artifact presence — not by reading the transcript, but by checking that expected files exist or expected content appears in output.

Evolution is verified by artifact presence, not by transcript review.

Artifact Verification Table

Evolution Type	What to Check
Path C (Gap Fill)	`SKILL.md` present at expected path
Path B (Update)	`tasks/todo/<slug>-plan.md` AND `tasks/todo/copilot_prompt_<slug>.md` written
Path A+ (No-op)	No new files written; HANDOFF_BLOCK contains `STATUS: complete`
Category 3 (Lab Setup)	`improvement/run-config.json` written AND HANDOFF_BLOCK emitted
HANDOFF_BLOCK integrity	All 7 fields present: INTENT, TARGET, PATH, DISPATCH, STATUS, OUTPUTS, NEXT_ACTION
Confidence model	Low confidence prompt → clarifying question appears before Phase 2 audit

Phase 1 — Resolve Test Inputs

If invoked with all, find test scenarios:

ls temp/os-evolution-verifier/scenarios/*.json 2>/dev/null | sort

If invoked with a specific file, verify it exists and is valid JSON with required fields:

python3 -c "
import json, sys
d = json.load(open('$SCENARIO_FILE'))
required = ['id', 'name', 'path', 'prompt', 'expected_artifact', 'artifact_check']
missing = [f for f in required if f not in d]
if missing:
    print(f'SCHEMA ERROR: missing fields: {missing}'); sys.exit(1)
print(f'Scenario: {d[\"id\"]} — {d[\"name\"]}')
"

If no scenarios found and no file given, report:

"No test scenarios found. Create scenario JSON files in temp/os-evolution-verifier/scenarios/ or run the red-team-bundler to generate them from os-architect-agent.md."

Phase 2 — Dispatch os-architect (Single-Shot Simulation)

For each scenario, dispatch os-architect via Copilot CLI in simulation mode. The system prompt is the full content of plugins/agent-agentic-os/agents/os-architect-agent.md. The user turn is the scenario prompt.

# 1. Heartbeat (cheapest model — always first)
python3 plugins/cli-agents/skills/copilot-cli-agent/scripts/run_agent.py \
  /dev/null /dev/null temp/os-evolution-verifier/heartbeat.md \
  "HEARTBEAT CHECK: Respond HEARTBEAT_OK only."

# Confirm heartbeat before dispatching
grep -q "HEARTBEAT_OK" temp/os-evolution-verifier/heartbeat.md || \
  { echo "HEARTBEAT FAILED — aborting test run"; exit 1; }

# 2. Dispatch os-architect in single-shot simulation mode
OUTPUT_FILE="temp/os-evolution-verifier/output_${SCENARIO_ID}.md"

python3 plugins/cli-agents/skills/copilot-cli-agent/scripts/run_agent.py \
  plugins/agent-agentic-os/agents/os-architect-agent.md \
  /dev/null \
  "$OUTPUT_FILE" \
  "$SCENARIO_PROMPT" \
  claude-sonnet-4.6

Wait for completion. Check output file is non-empty (expect 100+ lines for a real run):

wc -l "$OUTPUT_FILE"

Phase 3 — Artifact Verification

Run the artifact check specified in the scenario's artifact_check field.

HANDOFF_BLOCK integrity check

# All 7 required fields must appear in output
FIELDS=("INTENT:" "TARGET:" "PATH:" "DISPATCH:" "STATUS:" "OUTPUTS:" "NEXT_ACTION:")
MISSING=()
for field in "${FIELDS[@]}"; do
  grep -q "$field" "$OUTPUT_FILE" || MISSING+=("$field")
done

if [ ${#MISSING[@]} -eq 0 ]; then
  echo "PASS: HANDOFF_BLOCK has all 7 required fields"
else
  echo "FAIL: HANDOFF_BLOCK missing: ${MISSING[*]}"
fi

File existence check (Path B/C)

# Check for expected artifact files written by os-evolution-planner
EXPECTED_FILE="$ARTIFACT_PATH"
if [ -f "$EXPECTED_FILE" ]; then
  echo "PASS: Artifact found at $EXPECTED_FILE"
  wc -l "$EXPECTED_FILE"
else
  echo "FAIL: Expected artifact not found: $EXPECTED_FILE"
fi

No-op check (Path A+)

# Verify STATUS: complete in HANDOFF_BLOCK and no new plan files created
grep -q "STATUS: complete" "$OUTPUT_FILE" && echo "PASS: Status is complete" || echo "FAIL: Status not complete"
PLAN_COUNT=$(find tasks/todo -name "*.md" -newer "$OUTPUT_FILE" 2>/dev/null | wc -l)
[ "$PLAN_COUNT" -eq 0 ] && echo "PASS: No new task files written" || echo "FAIL: $PLAN_COUNT unexpected task files created"

Confidence model check

# Low confidence prompt must produce a clarifying question before Phase 2
grep -q "Confidence: Low" "$OUTPUT_FILE" && echo "PASS: Confidence: Low detected" || echo "FAIL: Confidence field not Low"
# Check that Phase 2 audit was NOT started (no "Checking existing" or "audit" language before clarification)
CLARIFICATION_LINE=$(grep -n "?" "$OUTPUT_FILE" | head -1 | cut -d: -f1)
AUDIT_LINE=$(grep -n "Checking existing\|audit\|Phase 2" "$OUTPUT_FILE" | head -1 | cut -d: -f1)
[ -z "$AUDIT_LINE" ] || [ "$CLARIFICATION_LINE" -lt "$AUDIT_LINE" ] && \
  echo "PASS: Clarifying question appeared before audit" || \
  echo "FAIL: Audit started before clarifying question"

Phase 4 — Record Result

Append to temp/os-evolution-verifier/test-report.md:

## $SCENARIO_ID — $SCENARIO_NAME

**Status**: [PASS | FAIL]
**Path**: [A / A+ / B / C]
**Prompt**: `$SCENARIO_PROMPT`
**Artifact check**: $ARTIFACT_CHECK_COMMAND
**Evidence**:

[grep or file-exists output]

**Failure mode tested**: $FAILURE_MODE
**Time**: $ELAPSED seconds
---

Phase 5 — Summary Report

After all scenarios run, write summary to temp/os-evolution-verifier/test-report.md.

Each scenario result uses the structured EVOLUTION_VERIFICATION block:

## EVOLUTION_VERIFICATION
SESSION_ID: [from HANDOFF_BLOCK TARGET field or scenario id]
SESSION_COMPLETE: [true | false — false means session still in Phase 1/2, no HANDOFF_BLOCK expected]
STATUS: [complete | intentional_pause | crashed]
PATH: [A | A+ | B | C | pending]
OUTPUTS_DECLARED: [N — count of files mentioned in HANDOFF_BLOCK OUTPUTS field]
OUTPUTS_VERIFIED: [N — count that passed artifact check]
OUTPUTS_MISSING: [list of missing file paths, or "none"]
HANDOFF_BLOCK_VALID: [true | false | N/A — N/A when SESSION_COMPLETE: false]
SCAFFOLD_VALID: [true | false | N/A]
PLAN_WRITTEN: [true | false | N/A]
DISPATCH_RAN: [true | false | N/A]
VERDICT: [PASS | PARTIAL | FAIL]
NOTES: [any file-level anomalies or ordering violations]

STATUS field values — required, disambiguates SESSION_COMPLETE: false:

STATUS	When to use	VERDICT
`complete`	SESSION_COMPLETE: true; HANDOFF_BLOCK present and valid	PASS or PARTIAL
`intentional_pause`	SESSION_COMPLETE: false; agent asked a clarifying question or hit a documented HARD-GATE; output > 50 lines	PASS (gate behavior is correct)
`crashed`	SESSION_COMPLETE: false; output < 50 lines, no clarifying question, no HANDOFF_BLOCK, or run_agent.py returned non-zero	FAIL

When SESSION_COMPLETE: false and STATUS: intentional_pause, HANDOFF_BLOCK_VALID must be N/A — a missing HANDOFF_BLOCK is expected behavior, not a schema violation.

When SESSION_COMPLETE: false and STATUS: crashed, VERDICT must be FAIL regardless of any other fields — a silent crash must never be reported as PARTIAL or PASS.

Use PARTIAL when some outputs are present but not all — it pinpoints exactly which workstream failed rather than collapsing everything into a binary pass/fail.

Binary PASS/FAIL Contract

A run PASSES only if ALL of the following are true:

At least 1 artifact is present at a declared OUTPUTS path
HANDOFF_BLOCK contains all 7 required fields
STATUS is not crashed
EVOLUTION_VERIFICATION VERDICT is PASS (PARTIAL counts as FAIL for gating — logged but does not unblock pipeline)

A run FAILS if any condition above is not met, OR if VERDICT is PARTIAL. PARTIAL means outputs are incomplete — this is a FAIL for any gating decision, even though it is logged separately for diagnostic purposes.

Adversarial threshold: When running WS-N failure injection scenarios (N-01 through N-06), the verifier must produce FAIL verdicts on at least 4 of 6 adversarial inputs. A verifier that passes all adversarial inputs is not operational — it is only checking the happy path.

Critical scenario requirement: N-04 (malformed run-config), N-05 (truncated plan), and N-06 (bad evals schema) MUST ALL produce FAIL verdicts. These test structural failures, not just crashes. A verifier that catches crashes (N-01/N-02/N-03) but misses structural failures (N-04/N-05/N-06) has a ceiling of 3/6 and is not detecting the important failure modes.

Follow with the aggregate summary:

## Run Summary

Total: N scenarios
PASS: X
PARTIAL: Y
FAIL: Z

### Failed / Partial Tests
- TEST-N: <name> — <what specifically failed>

### Evolution Gaps Found
[For each FAIL/PARTIAL: classify as spec fix / new skill needed / new eval case]

### Recommended Actions
1. [Priority: Critical] Fix <gap> in os-architect-agent.md
2. [Priority: High] Add new eval case for <scenario>
3. [Priority: Medium] Create new skill <skill-name> for <capability>

Phase 6 — Persist to Experiment Log

After Phase 5 summary is written, always call os-experiment-log to persist the run:

python3 scripts/experiment_log.py append \
  --report temp/os-evolution-verifier/test-report.md \
  --triggered-by os-evolution-verifier

This is not optional. temp/ is ephemeral — if the log is not appended immediately after the run, the results are lost when the shell restarts. The experiment log is the durable record.

Scenario File Format

Test scenarios live in temp/os-evolution-verifier/scenarios/:

{
  "id": "TEST-1",
  "name": "Path C — monitoring agent gap fill",
  "category": 4,
  "path": "C",
  "prompt": "There's no skill for automatically monitoring plugin health and flagging stale evals — I want to create one.",
  "expected_artifact": "tasks/todo/copilot_prompt_",
  "artifact_check": "file_prefix",
  "expected_behavior": "os-architect classifies as Cat4 Gap Fill, runs audit, proposes Path C, dispatches os-evolution-planner to write task plan + copilot_prompt file",
  "failure_mode": "agent routes to wrong category or fails to dispatch os-evolution-planner"
}

Smoke Tests

Three fast verification cases to confirm the skill itself is working:

Smoke 1 — Heartbeat check: Run heartbeat only, confirm HEARTBEAT_OK in output. Expected: heartbeat.md non-empty, contains HEARTBEAT_OK. Time: <30s.

Smoke 2 — Single scenario dry run: Run TEST-1 (Path C gap fill). Confirm output file is >100 lines. Time: <3 min.

Smoke 3 — HANDOFF_BLOCK field scan: On an existing output file, run the 7-field grep. Confirm all 7 fields found. Time: <5s.

Gotchas

Output files must be >100 lines: A Copilot CLI call that returns <50 lines usually means the model hit a refusal, the system prompt was too long, or the heartbeat was skipped. Always heartbeat first and always check wc -l before running artifact checks.
Single-shot simulation ≠ real dispatch: os-architect in simulation mode cannot write files to disk (no Write tool access during simulation). Artifact checks for Path B/C test whether the agent PROPOSES the correct files in its output, not whether they exist on disk. Real file-existence checks only apply when os-architect is run with full tool access.
HANDOFF_BLOCK field order matters for grep: Use grep -q "FIELD:" not grep -q "FIELD" — otherwise partial matches on word fragments will produce false positives.
Confidence model check is order-sensitive: The clarifying question must appear BEFORE any audit output. Line-number comparison is required; simple grep -q is insufficient.
temp/ files are ephemeral — distinguish shell restart from crash: If a run was interrupted by a shell restart and temp/copilot_output_*.md is missing, set STATUS: intentional_pause, VERDICT: PARTIAL (inconclusive) — the run never completed. If the file is present but < 50 lines AND run_agent.py returned non-zero, set STATUS: crashed, VERDICT: FAIL — the agent halted unexpectedly. Never report a silent crash as PARTIAL.
OUTPUTS field path normalization: HANDOFF_BLOCK OUTPUTS lists paths relative to project root. Normalize before checking (strip leading ./, resolve ~). A path mismatch between declared and actual is a schema drift signal, not a file-missing signal.
Category 5 tests produce two sequential dispatches: When verifying Category 5 output, check that two separate PATH / TARGET pairs appear in HANDOFF_BLOCK, not one.