| name | os-evolution-verifier |
| plugin | agent-agentic-os |
| description | Verifies that os-architect actually causes evolution — not just words. Dispatches os-architect in single-shot simulation mode for a given test scenario, then checks for real artifact presence (new files, HANDOFF_BLOCK, plan files). Reports PASS / FAIL with grep evidence. Accumulates results into a test report. Use after any changes to os-architect, os-evolution-planner, or improvement-intake-agent.
|
| argument-hint | [test-scenario-file | all] |
| tools | ["Bash","Read","Write"] |
Overview
After evolving os-architect or its downstream agents, you need proof that the changes
actually work. This skill dispatches os-architect in single-shot simulation mode for
each test scenario and verifies artifact presence — not by reading the transcript, but
by checking that expected files exist or expected content appears in output.
Evolution is verified by artifact presence, not by transcript review.
Artifact Verification Table
| Evolution Type | What to Check |
|---|
| Path C (Gap Fill) | SKILL.md present at expected path |
| Path B (Update) | tasks/todo/<slug>-plan.md AND tasks/todo/copilot_prompt_<slug>.md written |
| Path A+ (No-op) | No new files written; HANDOFF_BLOCK contains STATUS: complete |
| Category 3 (Lab Setup) | improvement/run-config.json written AND HANDOFF_BLOCK emitted |
| HANDOFF_BLOCK integrity | All 7 fields present: INTENT, TARGET, PATH, DISPATCH, STATUS, OUTPUTS, NEXT_ACTION |
| Confidence model | Low confidence prompt → clarifying question appears before Phase 2 audit |
Phase 1 — Resolve Test Inputs
If invoked with all, find test scenarios:
ls temp/os-evolution-verifier/scenarios/*.json 2>/dev/null | sort
If invoked with a specific file, verify it exists and is valid JSON with required fields:
python3 -c "
import json, sys
d = json.load(open('$SCENARIO_FILE'))
required = ['id', 'name', 'path', 'prompt', 'expected_artifact', 'artifact_check']
missing = [f for f in required if f not in d]
if missing:
print(f'SCHEMA ERROR: missing fields: {missing}'); sys.exit(1)
print(f'Scenario: {d[\"id\"]} — {d[\"name\"]}')
"
If no scenarios found and no file given, report:
"No test scenarios found. Create scenario JSON files in temp/os-evolution-verifier/scenarios/
or run the red-team-bundler to generate them from os-architect-agent.md."
Phase 2 — Dispatch os-architect (Single-Shot Simulation)
For each scenario, dispatch os-architect via Copilot CLI in simulation mode.
The system prompt is the full content of plugins/agent-agentic-os/agents/os-architect-agent.md.
The user turn is the scenario prompt.
python3 plugins/cli-agents/skills/copilot-cli-agent/scripts/run_agent.py \
/dev/null /dev/null temp/os-evolution-verifier/heartbeat.md \
"HEARTBEAT CHECK: Respond HEARTBEAT_OK only."
grep -q "HEARTBEAT_OK" temp/os-evolution-verifier/heartbeat.md || \
{ echo "HEARTBEAT FAILED — aborting test run"; exit 1; }
OUTPUT_FILE="temp/os-evolution-verifier/output_${SCENARIO_ID}.md"
python3 plugins/cli-agents/skills/copilot-cli-agent/scripts/run_agent.py \
plugins/agent-agentic-os/agents/os-architect-agent.md \
/dev/null \
"$OUTPUT_FILE" \
"$SCENARIO_PROMPT" \
claude-sonnet-4.6
Wait for completion. Check output file is non-empty (expect 100+ lines for a real run):
wc -l "$OUTPUT_FILE"
Phase 3 — Artifact Verification
Run the artifact check specified in the scenario's artifact_check field.
HANDOFF_BLOCK integrity check
FIELDS=("INTENT:" "TARGET:" "PATH:" "DISPATCH:" "STATUS:" "OUTPUTS:" "NEXT_ACTION:")
MISSING=()
for field in "${FIELDS[@]}"; do
grep -q "$field" "$OUTPUT_FILE" || MISSING+=("$field")
done
if [ ${#MISSING[@]} -eq 0 ]; then
echo "PASS: HANDOFF_BLOCK has all 7 required fields"
else
echo "FAIL: HANDOFF_BLOCK missing: ${MISSING[*]}"
fi
File existence check (Path B/C)
EXPECTED_FILE="$ARTIFACT_PATH"
if [ -f "$EXPECTED_FILE" ]; then
echo "PASS: Artifact found at $EXPECTED_FILE"
wc -l "$EXPECTED_FILE"
else
echo "FAIL: Expected artifact not found: $EXPECTED_FILE"
fi
No-op check (Path A+)
grep -q "STATUS: complete" "$OUTPUT_FILE" && echo "PASS: Status is complete" || echo "FAIL: Status not complete"
PLAN_COUNT=$(find tasks/todo -name "*.md" -newer "$OUTPUT_FILE" 2>/dev/null | wc -l)
[ "$PLAN_COUNT" -eq 0 ] && echo "PASS: No new task files written" || echo "FAIL: $PLAN_COUNT unexpected task files created"
Confidence model check
grep -q "Confidence: Low" "$OUTPUT_FILE" && echo "PASS: Confidence: Low detected" || echo "FAIL: Confidence field not Low"
CLARIFICATION_LINE=$(grep -n "?" "$OUTPUT_FILE" | head -1 | cut -d: -f1)
AUDIT_LINE=$(grep -n "Checking existing\|audit\|Phase 2" "$OUTPUT_FILE" | head -1 | cut -d: -f1)
[ -z "$AUDIT_LINE" ] || [ "$CLARIFICATION_LINE" -lt "$AUDIT_LINE" ] && \
echo "PASS: Clarifying question appeared before audit" || \
echo "FAIL: Audit started before clarifying question"
Phase 4 — Record Result
Append to temp/os-evolution-verifier/test-report.md:
## $SCENARIO_ID — $SCENARIO_NAME
**Status**: [PASS | FAIL]
**Path**: [A / A+ / B / C]
**Prompt**: `$SCENARIO_PROMPT`
**Artifact check**: $ARTIFACT_CHECK_COMMAND
**Evidence**:
[grep or file-exists output]
**Failure mode tested**: $FAILURE_MODE
**Time**: $ELAPSED seconds
---
Phase 5 — Summary Report
After all scenarios run, write summary to temp/os-evolution-verifier/test-report.md.
Each scenario result uses the structured EVOLUTION_VERIFICATION block:
## EVOLUTION_VERIFICATION
SESSION_ID: [from HANDOFF_BLOCK TARGET field or scenario id]
SESSION_COMPLETE: [true | false — false means session still in Phase 1/2, no HANDOFF_BLOCK expected]
STATUS: [complete | intentional_pause | crashed]
PATH: [A | A+ | B | C | pending]
OUTPUTS_DECLARED: [N — count of files mentioned in HANDOFF_BLOCK OUTPUTS field]
OUTPUTS_VERIFIED: [N — count that passed artifact check]
OUTPUTS_MISSING: [list of missing file paths, or "none"]
HANDOFF_BLOCK_VALID: [true | false | N/A — N/A when SESSION_COMPLETE: false]
SCAFFOLD_VALID: [true | false | N/A]
PLAN_WRITTEN: [true | false | N/A]
DISPATCH_RAN: [true | false | N/A]
VERDICT: [PASS | PARTIAL | FAIL]
NOTES: [any file-level anomalies or ordering violations]
STATUS field values — required, disambiguates SESSION_COMPLETE: false:
| STATUS | When to use | VERDICT |
|---|
complete | SESSION_COMPLETE: true; HANDOFF_BLOCK present and valid | PASS or PARTIAL |
intentional_pause | SESSION_COMPLETE: false; agent asked a clarifying question or hit a documented HARD-GATE; output > 50 lines | PASS (gate behavior is correct) |
crashed | SESSION_COMPLETE: false; output < 50 lines, no clarifying question, no HANDOFF_BLOCK, or run_agent.py returned non-zero | FAIL |
When SESSION_COMPLETE: false and STATUS: intentional_pause, HANDOFF_BLOCK_VALID must be N/A —
a missing HANDOFF_BLOCK is expected behavior, not a schema violation.
When SESSION_COMPLETE: false and STATUS: crashed, VERDICT must be FAIL regardless of
any other fields — a silent crash must never be reported as PARTIAL or PASS.
Use PARTIAL when some outputs are present but not all — it pinpoints exactly which
workstream failed rather than collapsing everything into a binary pass/fail.
Binary PASS/FAIL Contract
A run PASSES only if ALL of the following are true:
- At least 1 artifact is present at a declared OUTPUTS path
- HANDOFF_BLOCK contains all 7 required fields
- STATUS is not
crashed
- EVOLUTION_VERIFICATION VERDICT is PASS
(PARTIAL counts as FAIL for gating — logged but does not unblock pipeline)
A run FAILS if any condition above is not met, OR if VERDICT is PARTIAL.
PARTIAL means outputs are incomplete — this is a FAIL for any gating decision,
even though it is logged separately for diagnostic purposes.
Adversarial threshold: When running WS-N failure injection scenarios (N-01 through N-06),
the verifier must produce FAIL verdicts on at least 4 of 6 adversarial inputs. A verifier
that passes all adversarial inputs is not operational — it is only checking the happy path.
Critical scenario requirement: N-04 (malformed run-config), N-05 (truncated plan), and
N-06 (bad evals schema) MUST ALL produce FAIL verdicts. These test structural failures, not
just crashes. A verifier that catches crashes (N-01/N-02/N-03) but misses structural failures
(N-04/N-05/N-06) has a ceiling of 3/6 and is not detecting the important failure modes.
Follow with the aggregate summary:
## Run Summary
Total: N scenarios
PASS: X
PARTIAL: Y
FAIL: Z
### Failed / Partial Tests
- TEST-N: <name> — <what specifically failed>
### Evolution Gaps Found
[For each FAIL/PARTIAL: classify as spec fix / new skill needed / new eval case]
### Recommended Actions
1. [Priority: Critical] Fix <gap> in os-architect-agent.md
2. [Priority: High] Add new eval case for <scenario>
3. [Priority: Medium] Create new skill <skill-name> for <capability>
Phase 6 — Persist to Experiment Log
After Phase 5 summary is written, always call os-experiment-log to persist the run:
python3 scripts/experiment_log.py append \
--report temp/os-evolution-verifier/test-report.md \
--triggered-by os-evolution-verifier
This is not optional. temp/ is ephemeral — if the log is not appended immediately after
the run, the results are lost when the shell restarts. The experiment log is the durable record.
Scenario File Format
Test scenarios live in temp/os-evolution-verifier/scenarios/:
{
"id": "TEST-1",
"name": "Path C — monitoring agent gap fill",
"category": 4,
"path": "C",
"prompt": "There's no skill for automatically monitoring plugin health and flagging stale evals — I want to create one.",
"expected_artifact": "tasks/todo/copilot_prompt_",
"artifact_check": "file_prefix",
"expected_behavior": "os-architect classifies as Cat4 Gap Fill, runs audit, proposes Path C, dispatches os-evolution-planner to write task plan + copilot_prompt file",
"failure_mode": "agent routes to wrong category or fails to dispatch os-evolution-planner"
}
Smoke Tests
Three fast verification cases to confirm the skill itself is working:
Smoke 1 — Heartbeat check: Run heartbeat only, confirm HEARTBEAT_OK in output.
Expected: heartbeat.md non-empty, contains HEARTBEAT_OK. Time: <30s.
Smoke 2 — Single scenario dry run: Run TEST-1 (Path C gap fill). Confirm output file
is >100 lines. Time: <3 min.
Smoke 3 — HANDOFF_BLOCK field scan: On an existing output file, run the 7-field grep.
Confirm all 7 fields found. Time: <5s.
Gotchas
-
Output files must be >100 lines: A Copilot CLI call that returns <50 lines usually means
the model hit a refusal, the system prompt was too long, or the heartbeat was skipped.
Always heartbeat first and always check wc -l before running artifact checks.
-
Single-shot simulation ≠ real dispatch: os-architect in simulation mode cannot write
files to disk (no Write tool access during simulation). Artifact checks for Path B/C test
whether the agent PROPOSES the correct files in its output, not whether they exist on disk.
Real file-existence checks only apply when os-architect is run with full tool access.
-
HANDOFF_BLOCK field order matters for grep: Use grep -q "FIELD:" not grep -q "FIELD" —
otherwise partial matches on word fragments will produce false positives.
-
Confidence model check is order-sensitive: The clarifying question must appear BEFORE any
audit output. Line-number comparison is required; simple grep -q is insufficient.
-
temp/ files are ephemeral — distinguish shell restart from crash: If a run was
interrupted by a shell restart and temp/copilot_output_*.md is missing, set
STATUS: intentional_pause, VERDICT: PARTIAL (inconclusive) — the run never completed.
If the file is present but < 50 lines AND run_agent.py returned non-zero, set
STATUS: crashed, VERDICT: FAIL — the agent halted unexpectedly. Never report a
silent crash as PARTIAL.
-
OUTPUTS field path normalization: HANDOFF_BLOCK OUTPUTS lists paths relative to project
root. Normalize before checking (strip leading ./, resolve ~). A path mismatch between
declared and actual is a schema drift signal, not a file-missing signal.
-
Category 5 tests produce two sequential dispatches: When verifying Category 5 output,
check that two separate PATH / TARGET pairs appear in HANDOFF_BLOCK, not one.