with one click
agentic-testing
// Human-first runtime testing for Script Kit GPUI: operate the real app through visible user paths to surface UX/UI interaction bugs, then back findings with receipts, screenshots, exact targets, and cleanup.
// Human-first runtime testing for Script Kit GPUI: operate the real app through visible user paths to surface UX/UI interaction bugs, then back findings with receipts, screenshots, exact targets, and cleanup.
Use the local agy CLI as a fast Script Kit GPUI app inspector by prompting it to drive existing script-kit-devtools primitives, capture logs, and produce a compact investigation result from a user bug report or inspection prompt.
Create mdflow-backed agent files for Script Kit. Compatibility path ā prefer creating skills (SKILL.md under a plugin's skills/ directory) for new reusable AI work.
Manage the Notes window ā creating, searching, editing, and automating notes via the SDK and automation protocol.
Canonical bootstrap for isolated Script Kit GPUI DevTools sessions: one front door script (classify, verify-script, start, prove, cleanup), repo-level helpers under scripts/agentic/, bounded timeouts, JSON stdout, progress stderr. Use when agents need parallel DevTools workers, isolated runtime proof, SK_VERIFY script gates, or dev-watch reuse without starting a second GPUI under ./dev.sh.
ACP Agent Chat lifecycle, AcpChatView, embedded/detached chat windows, agent/model selection, setup cards, streaming, cancellation, and chat close/reuse behavior.
ACP composer input, slash commands, @mentions, context parts, attachment tokens, pasted text/image, /new-script, focused target, and context staging.
| name | agentic-testing |
| description | Human-first runtime testing for Script Kit GPUI: operate the real app through visible user paths to surface UX/UI interaction bugs, then back findings with receipts, screenshots, exact targets, and cleanup. |
This canonical repo-local skill owns human-first runtime testing and session cleanup for Script Kit GPUI. It combines the current .agents/skills routing policy with the full operational recipe previously kept under .codex/skills / .claude/skills.
The core purpose is to act like a real user would with the actual app so interaction, UX, and UI bugs surface before code is declared correct. Receipts, state snapshots, screenshots, and exact targets exist to make that user-like exploration trustworthy; they are not a substitute for reproducing the real user path.
This skill is not a receipt factory, a contract checklist, or a way to avoid looking at the app. Its foundation is interaction-driving testing: open the real product, exercise the real surface, perform the action a user would perform, and surface the UX/UI bug as an observable interaction failure. If a user reports something visual, behavioral, confusing, broken, clipped, hidden, misfocused, misrouted, unclickable, unreadable, or otherwise wrong in the UI, the default response is to drive the app like a user until the bug is reproduced, disproved, fixed, or explicitly blocked. A state receipt without user-path interaction is not enough.For user-reported UX/UI bugs, start by recreating the user's visible workflow against the real product surface. Open the app when the report is about what a user sees or does, drive the same entry path, interact with the visible controls, and capture evidence that another person can inspect.
State-first receipts are still required, but they support the user-path proof. A fail-closed receipt contract is useful only after the skill has identified the actual surface, entry path, visible invariant, and interaction sequence the user cares about.
Never let "state-first" become "user-last." State is the witness, not the experience. The experience is the real app being driven through the user's path so bugs actually reveal themselves.Use this .agents/skills/agentic-testing file as the source of truth for Codex routing and runtime proof. Legacy .codex/skills/agentic-testing and .claude/skills/agentic-testing files are compatibility snapshots only; mine them only when auditing history.
Primary paths and concepts:
scripts/agentic/, .test-output/, .test-screenshots/getState, getElements, inspectAutomationWindow, waitFor, batch, ACP state/probe APIs, and recipe receiptsStart with these sources before editing or proving behavior:
.agents/subagents/agentic-testing-reader.md for broad or high-risk investigation.AGENTS.md, the owning skill, and current source context before editing.Do not use this skill as the primary owner for test authoring or product ownership decisions; load $testing-quality-gates, $protocol-automation, $script-kit-devtools, or the relevant domain skill when those surfaces own the behavior.
Most verification runs should complete in seconds, not minutes, but speed must not erase the user's visible workflow. Default to the smallest proof tier that answers the question without stealing focus or blocking the user.
getElements, getState, waitFor, batch, and exact automation targets. This backs the visible workflow and is the default for routing, selection, focus, popup ownership, and protocol bugs.Rules:
/tmp via the session wrapper. Runtime captureWindow screenshots must go in project .test-screenshots/ / test-screenshots/ or ~/.scriptkit/screenshotsBefore every runtime proof, assert the allowed and blocked capabilities explicitly. Default to --local-fixture-only, --dry-run-only, --no-network, --no-external-services, --no-system-pasteboard, --no-native-picker, --no-quick-look, --no-system-settings, --no-tcc-mutation, --no-security-prompts, --no-native-input, and --no-native-pointer unless the proof documents a protocol-only exception.
Receipts for AFK-safe runs should include an afkSafe: true equivalent plus blockedCapabilities and usedCapabilities lists. If the app or harness cannot prove that an unsafe capability stayed blocked, fail closed and file the missing receipt.
Turn a user-filed UI/UX bug into a proof plan before choosing or adding a recipe:
After two failed proof attempts caused by missing receipts, stale targets, unavailable handles, capture mismatch, or unsafe capability refusal, stop broadening the test. Classify the blocker as product bug, instrumentation gap, unsafe-to-prove, or test harness bug. Do not add sleeps, screenshots, focus hacks, native input, or larger recipes as a workaround.
Permission denied, security prompt pending, setup required, network unavailable, pasteboard blocked, native picker unavailable, and destructive confirmation states must be represented by fixture/protocol state only. Receipts must prove no real OS prompt, settings write, install flow, API call, pasteboard mutation, or destructive command occurred.
For user-filed bug smoke tests, report a result that is more specific than generic PASS/FAIL: reproduced, not-reproduced, fixed, blocked-by-missing-receipt, blocked-by-unsafe-operation, or inconclusive. Include exact receipts/screenshots used, cleanup status, and remaining user-visible risk.
AcpChatView in isolation is not sufficient proof. Default to the real ACP entry path (triggerBuiltin tab-ai, detached chat window routing, or another production runtime path) before using any synthetic ACP harness.Visual proof must connect what the user sees to structured layout and semantic receipts. Use visible text, layout measurement, and screenshot-to-semantics diagnostics before trusting a screenshot as UX proof.
screenshot-semantics-visual-consistency-stress for pass-now visual consistency. It checks strict capture identity, non-blank content audit, getState, getElements, selected row, focus receipt, footer actions, popup crop bounds, and semantic visible text labels.visibleTextMode:"semanticElements" means the harness found visible text from automation element labels. It is not OCR and not clipping proof.getElements receipts.visible-text-clipping-overlap-stress for clipping and overlap audits; it opens the real main window and combines getElements, getLayoutInfo, and AppKit text measurement to report text bounds, measured width, available width, clipping state, truncation intent, tooltip or accessible full text, overlap pairs, and cleanup.layout-measurement-regression-stress; it opens the real main window, records getLayoutInfo component bounds plus bounded getElements semantics, then drives setFilter and reset to prove filter-churn layout fingerprints stay stable. Treat non-main surface warnings as remaining coverage gaps, not proof.main-menu-dynamic-choice-resize-stress; it opens real small and large choice prompts in one app session, compares visibleChoiceCount against the fixture counts, measures getWindowBounds before/after, requires height growth with stable width, and cleans up through Escape.notes-window-resize-stress; it opens the real Notes window against a sandbox notes DB, drives targeted Notes batch.setInput through the editor path, measures Notes window bounds before tall content, after tall content, and after short content, and requires height growth, height shrink, stable width, and cleanup.div or container is scroll-safe from a screenshot alone. Use div-container-scroll-overflow-stress; it opens a real DivPrompt, requires a DivContent layout component instead of launcher ScriptList/PreviewPanel components, estimates fixture content height against the viewport, proves scrolling is required, checks cleanup, and warns until div scroll position is exposed as a first-class receipt.visual-contrast-readable-state-stress; it opens the real main window for visible semantics and collects AGENTIC_THEME_CONTRAST_RECEIPT foreground/background token samples with contrast ratios, minimum ratios, pass flags, theme fingerprints, and cleanup.bun scripts/agentic/user-story-audit.ts --limit 100 --max-ms 60000. It converts existing stress recipes into user-shaped stories, skips stories already exercised in the current thread by default, writes .test-output/agentic-100-user-story-audit-*.json, and separates pass, fail_closed, blocked_precondition, runtime_failure, and timeout. Treat fail-closed results as missing proof/backlog, never as a UI pass.When a user flow spans stacked modals, cross-surface export, or app restart recovery, require one receipt that proves ownership boundaries before sending input.
actions-command-discoverability-noop-stress to drive real Cmd-K action popups and prove visible action row ids, labels, sections, shortcuts, destructive/enabled flags, context identity, safe Escape cleanup, and no accidental execution. Treat disabled/no-op state warnings as coverage gaps until fixtures expose those row kinds.notes-window-resize-stress to prove a real Notes window opened through openNotes, targeted Notes editor input changed through batch.setInput, a sandbox notes DB isolated user data, tall content grew the window, short content shrank it, width stayed stable, Notes stayed visible, and the launched session was stopped.Every verification follows the same core loop:
cargo build 2>&1 | tail -5
Only rebuild when the touched files can invalidate the binary or helper you need to exercise.
cargo build.If you do build, it must complete with Finished. If it fails, fix the build error first.
# First look for a healthy reusable session
bun scripts/agentic/session-state.ts --list
bash scripts/agentic/session.sh status default
# Start or resume a named session ā works from any shell
# session.sh waits for the APP_READY log marker instead of sleeping
SESSION_JSON="$(bash scripts/agentic/session.sh start default 2>/dev/null)"
APP_PID="$(printf '%s' "$SESSION_JSON" | jq -r '.pid')"
PIPE="$(printf '%s' "$SESSION_JSON" | jq -r '.pipe')"
LOG="$(printf '%s' "$SESSION_JSON" | jq -r '.log')"
READY="$(printf '%s' "$SESSION_JSON" | jq -r '.ready // false')"
READY_WAIT_MS="$(printf '%s' "$SESSION_JSON" | jq -r '.readyWaitMs // 0')"
# Fallback only if readiness marker was not observed.
if [ "$READY" != "true" ]; then
sleep 0.5
fi
The session wrapper manages the named pipe, forwarder process, and PID tracking.
Sessions are reusable across shells ā no exec 3> / fd 3 trick required.
session.sh start means the app is stdin-ready, not necessarily capture-ready.
Prefer resume over cold start. A warm session plus state-only receipts should be the default path.
Session commands:
bash scripts/agentic/session.sh start [NAME] # Create or resume (default: "default")
bash scripts/agentic/session.sh send NAME CMD # Send JSON command
bash scripts/agentic/session.sh status [NAME] # Check session state (JSON)
bash scripts/agentic/session.sh stop [NAME] # Stop and clean up
bun scripts/agentic/session-state.ts --session NAME # Detailed state report
bun scripts/agentic/session-state.ts --list # List all sessions
All commands emit stable JSON envelopes on stdout (schemaVersion, status, payload).
Diagnostics go to stderr.
start is idempotent ā re-running it resumes an existing healthy session.
Alternative (legacy, single-shell only):
PIPE=$(mktemp -u)
mkfifo "$PIPE"
export SCRIPT_KIT_AI_LOG=1
./target/debug/script-kit-gpui < "$PIPE" > /tmp/sk-test.log 2>&1 &
APP_PID=$!
exec 3>"$PIPE"
sleep 3
# Session-based (any shell)
bash scripts/agentic/session.sh send default '{"type":"show"}'
sleep 1.5
The app starts hidden. State-only proofs should usually skip this step entirely.
Show the window only for screenshots, native input, or other proofs that require the real visible surface.
Send commands via the session. Common commands:
S="bash scripts/agentic/session.sh send default"
# Set filter text
$S '{"type":"setFilter","text":"search term"}'
# Read current state without touching focus
bash scripts/agentic/session.sh rpc default '{"type":"getState","requestId":"s1"}' --expect stateResult
# Discover visible elements (returns semantic IDs)
bash scripts/agentic/session.sh rpc default '{"type":"getElements","requestId":"e1"}' --expect elementsResult
# Discover an attached popup or detached surface directly by target
bash scripts/agentic/session.sh rpc default '{"type":"getElements","requestId":"e2","target":{"type":"kind","kind":"actionsDialog","index":0}}' --expect elementsResult
# Select element by semantic ID (from getElements response)
bash scripts/agentic/session.sh rpc default '{"type":"batch","requestId":"b1","commands":[{"type":"selectBySemanticId","semanticId":"choice:0:apple","submit":true}]}' --expect batchResult
# When supported, mutate popup state directly instead of typing through native focus
bash scripts/agentic/session.sh rpc default '{"type":"batch","requestId":"b2","target":{"type":"kind","kind":"actionsDialog","index":0},"commands":[{"type":"setInput","text":"alias"}]}' --expect batchResult
# Trigger a built-in view
$S '{"type":"triggerBuiltin","name":"clipboard"}'
$S '{"type":"triggerBuiltin","name":"tab-ai"}'
$S '{"type":"triggerBuiltin","name":"emoji"}'
$S '{"type":"triggerBuiltin","name":"apps"}'
$S '{"type":"triggerBuiltin","name":"file-search"}'
# Simulate keys (dispatches to current view; not suitable for interceptor bugs)
$S '{"type":"simulateKey","key":"enter","modifiers":[]}'
$S '{"type":"simulateKey","key":"escape","modifiers":[]}'
$S '{"type":"simulateKey","key":"k","modifiers":["cmd"]}'
$S '{"type":"simulateKey","key":"w","modifiers":["cmd"]}'
# Prefer GPUI event dispatch over simulateKey when you need the real key pipeline
bash scripts/agentic/session.sh rpc default '{"type":"simulateGpuiEvent","requestId":"g1","target":{"type":"main"},"event":{"type":"keyDown","key":"down","modifiers":[]}}' --expect simulateGpuiEventResult
# Type individual characters (for views with text input)
$S '{"type":"simulateKey","key":"h","modifiers":[]}'
# Query ACP state (returns input, cursor, picker, accepted item, thread status)
bash scripts/agentic/session.sh rpc default '{"type":"getAcpState","requestId":"acp1"}' --expect acpStateResult
mkdir -p .test-screenshots
bash scripts/agentic/session.sh send default '{"type":"captureWindow","title":"","path":"'"$(pwd)"'/.test-screenshots/step-01.png"}'
sleep 1
title is substring match. "" matches any window.title: "" or the resolver-driven verify-shot.ts / window.ts flow. Do not assume the title contains ACP Chat.$(pwd)/ prefix.captureWindow does not allow arbitrary /tmp/*.png output paths.sleep 1 after capture for file write.grep -i "keyword" /tmp/sk-test.log | head -20
Log format: TIMESTAMP|LEVEL|CATEGORY|cid=CORRELATION_ID message
# Session-based (preferred)
bash scripts/agentic/session.sh stop default
# Verify the session is actually gone before reporting success
bash scripts/agentic/session.sh status default
# Legacy fd 3 cleanup (single-shell only)
# exec 3>&-
# rm -f "$PIPE"
# kill $APP_PID 2>/dev/null || true
# wait $APP_PID 2>/dev/null || true
Cleanup is mandatory, even after failures or interrupted runs.
session.sh, run session.sh stop NAME and verify the session is no longer alive.wait for it.script-kit-gpui processes behind from agentic testing.| Action | Wait strategy |
|---|---|
| App startup | session.sh start readiness wait; fallback 0.5s only if ready=false |
| Warm session reuse | Prefer 0-1s status / resume over creating a fresh process |
| State-only proof | Aim for 3-10s total; no screenshot or OS focus |
show window | 0.3s macOS focus-settling delay |
setFilter | 1s sleep or waitFor stateMatch |
triggerBuiltin (opens new view) | waitFor appropriate condition |
simulateKey (view transition) | 1.5s sleep |
simulateKey (text input) | 0.1s sleep |
captureWindow | 1s sleep (file write) |
| ACP context bootstrap | waitFor(acpReady, timeout=8000) |
| ACP picker open | waitFor(acpPickerOpen, timeout=3000) |
| ACP picker accept | waitFor(acpItemAccepted, timeout=3000) |
| ACP response streaming | 10-20s or waitFor(acpStatus) |
Rule: Use waitFor for all ACP state transitions. Only use fixed sleeps
for macOS focus-settling (0.3s) and file I/O (1s screenshot write).
Rule: Do not add a fixed sleep 3 after session.sh start. The session
wrapper is responsible for readiness. Only use the 0.5s fallback when ready=false.
Rule: If a non-visual proof is trending beyond this budget, stop and redesign around getElements, getState, waitFor, batch, exact targets, or session reuse before escalating.
Use scripts/agentic/session.sh instead of hand-rolling mkfifo + exec 3> in ad hoc shells.
Why: The exec 3>"$PIPE" pattern ties the pipe to a single shell process. When a coding agent
spawns a new shell (e.g., follow-up verification step), fd 3 does not exist and the session is lost.
The session wrapper uses a background forwarder process so any shell can send commands via
session.sh send.
Rules:
session.sh start instead of manual mkfifo + exec 3> for new verification flowssession.sh send for fire-and-forget stdin commands like show, triggerBuiltin, setFilter, and captureWindowsession.sh rpc for protocol requests that expect a typed response like getAcpState, getElements, waitFor, batch, and inspectAutomationWindowsession.sh status or session-state.ts before sending commandssession.sh stop when done ā do not leave orphan processesThese are practical lessons from real ACP verification runs in this repo.
session.sh start reports a dead session even though the log reached STARTUP_READY, inspect the log before assuming the app crashed. In some debug runs the wrapper/forwarder dies while script-kit-gpui is still healthy. When that happens, switch to the legacy single-shell FIFO fallback so you can keep stdin open yourself.window.ts and macos-input.ts --ensure-focus may fail against the debug binary because the process name is script-kit-gpui, not the bundled Script Kit app identity. If focus/capture helpers cannot find the app, use System Events targeting process "script-kit-gpui" directly.screencapture -R<x,y,w,h> can be more reliable than the bundle-oriented window resolver. Use runtime automation bounds or System Events window position/size to compute the region.getAcpState after each step so you do not misread a correct first delete as a failure.Use verify-shot.ts for automated screenshot + state verification. It enforces
the correct ACP verification order: state receipt first, screenshot second.
# Basic: capture screenshot with ACP state assertions
bun scripts/agentic/verify-shot.ts --session default \
--label step-name \
--acp-status idle \
--acp-picker-closed \
--acp-context-ready
# Assert picker is open after typing @
bun scripts/agentic/verify-shot.ts --session default \
--label picker-open \
--acp-picker-open
# Assert item was accepted after Enter/Tab
bun scripts/agentic/verify-shot.ts --session default \
--label item-accepted \
--acp-picker-closed \
--acp-item-accepted
# State-only (skip screenshot)
bun scripts/agentic/verify-shot.ts --session default \
--label quick-check \
--skip-screenshot \
--acp-input-contains "@context"
# Screenshot-only (skip state query)
bun scripts/agentic/verify-shot.ts --session default \
--label visual-check \
--skip-state
Available assertions:
| Flag | Checks |
|---|---|
--acp-status STATUS | ACP status equals value (idle, streaming, etc.) |
--acp-picker-open | Picker overlay is visible |
--acp-picker-closed | Picker overlay is closed |
--acp-input-contains STR | Input text contains substring |
--acp-input-match STR | Input text matches exactly |
--acp-cursor-at N | Cursor at character index N |
--acp-item-accepted | A picker item was accepted (lastAcceptedItem non-null) |
--acp-accepted-label STR | lastAcceptedItem.label equals STR |
--acp-accepted-trigger STR | lastAcceptedItem.trigger equals STR (@ or /) |
--acp-accepted-via KEY | Probe confirms acceptance via enter or tab |
--acp-cursor-after-accepted N | Probe confirms cursor landed at index N after acceptance |
--acp-context-ready | Context bootstrap complete |
--acp-no-selection | No text selection active (hasSelection is false) |
--acp-has-selection | Text selection is active (hasSelection is true) |
--acp-no-permission | No pending permission (hasPendingPermission is false) |
--acp-has-permission | Pending permission present (hasPendingPermission is true) |
--acp-visible-start N | inputLayout.visibleStart equals N (first visible char index) |
--acp-visible-end N | inputLayout.visibleEnd equals N (last visible char index) |
--acp-cursor-in-window N | inputLayout.cursorInWindow equals N (cursor position in viewport) |
Proof bundle fields: The receipt includes stable top-level fields for machine consumption:
state (ACP snapshot), probe (test probe snapshot), screenshot (path + capture metadata),
captureTarget (requested vs actual window ID for identity proof),
visionCrops (structured image check entries). These are the canonical fields for automated parsing.
Capture identity threading: Detached ACP screenshots use the inspected
native osWindowId, not the automation window ID. When --target-json is
present, verify-shot.ts auto-lifts inspection.osWindowId into the screenshot
step. An explicit --capture-window-id is only an override and must match the
inspected osWindowId. The receipt exposes captureTarget.requestedWindowId,
captureTarget.actualWindowId, captureRouting, requestedAutomationWindowId,
and inspectionOsWindowId.
Exit codes: 0 = pass, 1 = assertion failure, 2 = infrastructure error.
Use visible-text-window assertions to verify single-line input rendering and cursor tracking without a screenshot:
bun scripts/agentic/verify-shot.ts --session default \
--label input-stability \
--skip-screenshot \
--acp-visible-start 12 \
--acp-visible-end 52 \
--acp-cursor-in-window 39
This proves the cursor is within the visible window and the viewport bounds are stable, which catches scroll jumps, layout shifts, and cursor-out-of-view regressions.
Strict capture: When ACP assertions are present, verify-shot.ts requires
window.ts quartz capture with frontmost confirmation and the exact inspected
native window ID. If focus drifts, the inspected osWindowId is missing, or the
captured windowId differs from the requested ID, the run fails instead of
silently falling back to a full-screen screenshot.
Rule: The recipe must fail when ACP state contradicts expected picker/caret outcome, even if the screenshot capture itself succeeds. State receipt is the primary proof; screenshot is secondary visual confirmation.
Always prefer the canonical CLI over ad hoc shell sequences. The orchestrator encodes the correct verification order, focus enforcement, probe resets, and checkpoint strategy so agents do not need to reconstruct these from scratch.
Use surface-proof as the default seconds-first proof command for an already-open product surface. For main-hosted surfaces, enter through the real runtime command and keep the proof state-first.
bash scripts/agentic/session.sh start default
bash scripts/agentic/session.sh send default '{"type":"triggerBuiltin","name":"clipboard"}' --await-parse
bun scripts/agentic/index.ts surface-proof --session default --kind main
bash scripts/agentic/session.sh stop default
bash scripts/agentic/session.sh status default
# Advanced exact-target proofs when a popup or detached surface already exists:
bun scripts/agentic/index.ts surface-proof --session default --kind promptPopup --index 0
bun scripts/agentic/index.ts surface-proof --session default --kind acpDetached --index 0
This path reuses a warm session, promotes the target through automation-window.ts inspect, samples getState and getElements, returns a machine-readable proof bundle, and does not call show, native input, or screenshot capture unless the proof explicitly needs that escalation.
Sample output shape:
{
"schemaVersion": 1,
"recipe": "surface-proof",
"status": "pass",
"summary": "State-first main proof succeeded for main",
"proofBundle": {
"schemaVersion": 2,
"scenario": "main-window-exact-id",
"surfaceClass": "main",
"resolvedTarget": {
"windowId": "main",
"windowKind": "Main"
},
"targetIdentity": { "stable": true },
"usage": {
"stateFirst": true,
"usedGetState": true,
"usedGetElements": true,
"usedScreenshot": false,
"usedNativeInput": false,
"usedShow": false,
"usedFixedSleepMs": 0
},
"capabilities": {
"state": true,
"elements": true,
"nativeInputRequired": false,
"screenshotRequired": false
},
"state": { "type": "stateResult" },
"elements": { "type": "elementsResult" },
"warnings": []
}
}
# Full ACP picker accept ā choose key with --key enter|tab
bun scripts/agentic/index.ts acp-accept --session default --key enter
bun scripts/agentic/index.ts acp-accept --session default --key tab --vision
# Target a specific ACP window (detached/popup) ā resolve exact identity first
RESOLVED="$(bun scripts/agentic/automation-window.ts resolve --session default --kind acpDetached --index 0)"
TARGET="$(printf '%s' "$RESOLVED" | jq -c '.targetJson')"
SURFACE_ID="$(printf '%s' "$RESOLVED" | jq -r '.surfaceId')"
bun scripts/agentic/index.ts acp-accept --session default --key enter \
--target-json "$TARGET" --surface "$SURFACE_ID" --vision
When verifying a detached or popup ACP window, resolve one target once and reuse it for every RPC and native input step in the entire run.
Canonical rule:
bun scripts/agentic/window.ts list).--target-json object (e.g., {"type":"kind","kind":"acpDetached","index":0}).getAcpState, getAcpTestProbe,
resetAcpTestProbe, waitFor, and batch.--surface value to native input so focus and proof stay
on the same window.The --target-json flag threads through index.ts ā verify-shot.ts ā every
RPC command, and the --surface flag threads through index.ts ā macos-input.ts
ā window.ts for focus enforcement.
When --target-json is omitted, RPCs default to the main ACP view (existing behavior).
What acp-accept guarantees:
macos-input.ts --ensure-focus for native typing and acceptanceacpAcceptedViaKey (key-specific proof, not generic acpItemAccepted)--vision is requested--vision is used, surfaces the full proof bundle (with state, probe, screenshot, visionCrops) as proofBundle in the recipe receipt# Check all prerequisites
bun scripts/agentic/index.ts preflight --session default
# Open ACP and verify ready state (state-only, no screenshot)
bun scripts/agentic/index.ts acp-open --session default
# Compatibility aliases (same as --key enter / --key tab)
bun scripts/agentic/index.ts acp-enter-accept --session default
bun scripts/agentic/index.ts acp-tab-accept --session default
# Hard-scenario recipes
bun scripts/agentic/index.ts acp-detached-target-threading-stress \
--session default --kind acpDetached --index 0 --min-targets 2 --key enter --vision --json
bun scripts/agentic/index.ts acp-prompt-popup-parity \
--session default --families mention,model-selector,local-history --json
bun scripts/agentic/index.ts notes-acp-delayed-action-origin-stress \
--session default --drift generation --json
bun scripts/agentic/index.ts file-portal-origin-roundtrip \
--session default --origin acp --portal file-search --selection file --query AGENTS.md --json
bun scripts/agentic/index.ts permission-privacy-preflight \
--session default --kinds accessibility,screen-recording,microphone --json
bun scripts/agentic/index.ts shortcut-recorder-focus-capture \
--session default --surface shortcuts --action test-agentic-shortcut --chord cmd+shift+7 --sandbox-config --json
bun scripts/agentic/index.ts template-prompt-automation-parity-stress \
--session default --template 'Hello {{name}}' --field name --value Ada --forced-value forced-template-result --json
bun scripts/agentic/index.ts current-app-commands-frontmost-stress \
--session default --alias 'Do in Current Command' --query 'close tab' --json
bun scripts/agentic/index.ts actions-captured-subject-frame-stress \
--session default --source root-file --action quick-look --mutation filter-selection-cache-frame --json
bun scripts/agentic/index.ts drop-prompt-native-drop-privacy-stress \
--session default --file-name agentic-drop.txt --size 12 --json
bun scripts/agentic/index.ts path-prompt-filesystem-edge-stress \
--session default --json
bun scripts/agentic/index.ts screenshot-identity-acp-context-stress \
--session default --source tab-ai-screenshot --json
bun scripts/agentic/index.ts clipboard-history-portal-range-stress \
--session default --portal-id 'kit://clipboard-history?id=agentic' --range composer:0..0 --json
bun scripts/agentic/index.ts browser-tabs-cache-identity-stress \
--session default --source browser-tabs --json
bun scripts/agentic/index.ts scroll-selection-reanchor-stress \
--session default --kinds clipboard,browser-history,current-app-commands,file-search --json
bun scripts/agentic/index.ts accessibility-tree-semantic-parity-stress \
--session default --surfaces main,actionsDialog,promptPopup --json
bun scripts/agentic/index.ts rtl-bidi-emoji-text-rendering-stress \
--session default --surface acp-composer --text 'abc ש××× š©š½āš» eĢ Ł
Ų±ŲŲØŲ§ 123' --json
bun scripts/agentic/index.ts high-volume-virtualized-list-stability-stress \
--session default --surface clipboard-history --fixture-count 5000 --filter-cycles 8 --scroll-cycles 12 --json
bun scripts/agentic/index.ts input-modality-transition-ownership-stress \
--session default --surface main --interleave pointer-hover,keyboard-nav,trackpad-scroll,wheel-scroll,shortcut --cycles 8 --json
bun scripts/agentic/index.ts multi-context-attachment-dedupe-provenance-stress \
--session default --origins file,screenshot,selected-text,mcp-resource,clipboard-snippet --destinations acp-composer,notes --reorder-cycles 3 --json
bun scripts/agentic/index.ts visual-contrast-readable-state-stress \
--session default --surfaces main,actionsDialog,promptPopup,acp-composer,notes --themes light,dark --scale-factors 1,1.25,1.5 --states active,inactive,disabled,focused,error,loading --json
bun scripts/agentic/index.ts empty-error-retry-state-ux-stress \
--session default --surfaces main,clipboard-history,emoji-picker,file-search --query 'agentic-loop-eighteen-no-results-zzzz' --json
bun scripts/agentic/index.ts form-validation-inline-recovery-stress \
--session default --surface fields-prompt --fields email,required-text,number --invalid email:not-an-email,required-text:,number:not-a-number --valid email:ada@example.com,required-text:Ada,number:42 --json
bun scripts/agentic/index.ts navigation-back-stack-history-stress \
--session default --origin main --surfaces clipboard-history,emoji-picker,file-search,actionsDialog --transitions triggerBuiltin,cmd-k,escape,back --json
| Checkpoint | Screenshot? | Probe? | Why |
|---|---|---|---|
| ACP ready | No | No | waitFor(acpReady) is sufficient proof; screenshot is waste |
| Picker open | No | No | waitFor(acpPickerOpen) is sufficient proof |
| Final accepted | Yes | Yes | The only checkpoint that needs visual + probe evidence |
Rule: Intermediate checkpoints use state-only verification (--skip-screenshot --skip-probe).
Only the final acceptance step captures a screenshot and queries the probe.
Each recipe returns a machine-readable JSON receipt:
{
"schemaVersion": 1,
"recipe": "acp-enter-accept",
"status": "pass",
"steps": [
{ "name": "acp-open", "status": "pass" },
{ "name": "reset-probe", "status": "pass" },
{ "name": "type-at-trigger", "status": "pass" },
{ "name": "wait-accepted-via-key", "status": "pass" },
{ "name": "verify-accepted", "status": "pass" }
]
}
When --vision is used, a proofBundle field is added containing the verify-shot receipt
with state, probe, screenshot, and visionCrops for direct machine consumption.
The wrapper does not replace the lower-level commands ā use session.sh,
macos-input.ts, window.ts, and verify-shot.ts directly when you need
finer control.
The mandatory verification flow for any ACP interaction testing.
Prefer the canonical CLI (bun scripts/agentic/index.ts acp-accept) over
reconstructing the manual steps below.
bash scripts/agentic/session.sh start default
bun scripts/agentic/index.ts acp-accept --session default --key enter --vision
# The recipe returns a machine-readable JSON receipt with proofBundle.
# Parse proofBundle.state, proofBundle.probe, proofBundle.screenshot, proofBundle.visionCrops
# to verify ACP behavior programmatically, then read the written PNG for final visual confirmation.
bash scripts/agentic/session.sh stop default
The scenario recipe resolves one exact detached ACP target once, reuses
the exact targetJson for every subsequent step, and emits a structured
proof bundle recording windowId, surfaceId, and ordered step receipts.
bash scripts/agentic/session.sh start default
bun scripts/agentic/index.ts scenario \
--session default \
--scenario detached-acp-exact-id \
--index 0
bash scripts/agentic/session.sh stop default
The proof bundle is the regression substrate ā every step records the exact
target used, the full request/response pair, and a timestamp. Exit code 0
means all steps succeeded; exit code 1 means some steps produced warnings.
For finer-grained control (e.g., picker acceptance flows), resolve one exact
ACP target once and reuse both the target and surfaceId for the full run.
Do not use loose family-level --surface acp ā use the exact surfaceId
from the resolver.
bash scripts/agentic/session.sh start default
# Resolve exact target and surface identity once
RESOLVED="$(bun scripts/agentic/automation-window.ts resolve --session default --kind acpDetached --index 0)"
TARGET="$(printf '%s' "$RESOLVED" | jq -c '.targetJson')"
SURFACE_ID="$(printf '%s' "$RESOLVED" | jq -r '.surfaceId')"
bun scripts/agentic/index.ts acp-accept --session default --key enter \
--target-json "$TARGET" --surface "$SURFACE_ID" --vision
INSPECTED="$(bun scripts/agentic/automation-window.ts inspect --session default --id "$(printf '%s' "$RESOLVED" | jq -r '.automationWindowId')")"
WINDOW_ID="$(printf '%s' "$INSPECTED" | jq -r '.osWindowId')"
bun scripts/agentic/index.ts acp-accept --session default --key enter \
--target-json "$TARGET" --surface "$SURFACE_ID" --vision
bun scripts/agentic/verify-shot.ts --session default --label detached-proof \
--target-json "$TARGET" --capture-window-id "$WINDOW_ID"
# Confirm proofBundle.state.resolvedTarget.windowKind == "acpDetached"
# Confirm captureTarget.requestedWindowId == captureTarget.actualWindowId
bash scripts/agentic/session.sh stop default
The --vision flag makes the recipe return a proofBundle containing all
machine-readable proof. The golden path is complete when the exit code is 0
and the proofBundle.state and proofBundle.probe fields confirm the expected
ACP state. Screenshot files are still written for archival but are not the
primary verification mechanism.
1. session start ā session alive
2. show ā window visible
3. triggerBuiltin tab-ai ā ACP opens
4. waitFor(acpReady, timeout=8000) ā context bootstrapped (deterministic)
5. focus window ā frontmost confirmed
6. native type @ (macos-input.ts --ensure-focus) ā open picker
7. waitFor(acpPickerOpen, timeout=3000) ā picker open (deterministic)
8. native Enter or Tab (macos-input.ts --ensure-focus) ā accept picker item
9. waitFor(acpAcceptedViaKey, timeout=3000) ā key-specific acceptance (deterministic)
10. verify-shot.ts with --acp-accepted-via ā state + probe + screenshot proof
Key tools in the golden path:
| Tool | Role |
|---|---|
session.sh | Cross-shell session management, RPC, lifecycle |
macos-input.ts | Native macOS keyboard/mouse with --ensure-focus |
window.ts | Window discovery, focus, activation, quartz capture |
verify-shot.ts | State + probe + screenshot bundle with strict capture |
automation-window.ts | Exact ACP target-to-surface resolver |
scenario.ts | Replayable proof-bundle scenarios for cross-window regression |
index.ts | Orchestrator that composes all of the above correctly |
waitFor replaces fixed sleeps. Each waitFor polls at 25ms intervals
and returns a waitForResult receipt with success, elapsed, and an
optional trace (included automatically on failure when trace: "onFailure").
State receipt before screenshot is non-negotiable. If the state says the picker is still open but the screenshot looks fine, the test must FAIL.
Any remaining sleeps in the recipes are brief macOS focus-settling delays (~300ms) with explicit comments. They are not proof of ACP state.
See references/recipes.md for named verification patterns.
Other hard-scenario recipes:
bun scripts/agentic/index.ts long-text-wrap-resize-surface-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog --widths mini,narrow,full --fixtures long-name,long-path,long-description,multiline-snippet --json
bun scripts/agentic/index.ts actions-command-discoverability-noop-stress --session default --hosts main,clipboard-history,emoji-picker,file-search,app-launcher --states actionable,disabled,no-op --json
bun scripts/agentic/index.ts dense-list-detail-preview-readability-stress --session default --surfaces file-search,sdk-reference,script-template-catalog --query agentic-loop-nineteen-preview --filter-cycles 4 --selection-cycles 8 --resize-cycles 3 --json
bun scripts/agentic/index.ts toast-notification-queue-lifecycle-stress --session default --surface main --fixtures success,duplicate,persistent,dismiss,autohide --cycles 3 --json
bun scripts/agentic/index.ts destructive-confirm-modal-safety-stress --session default --host main --fixture agentic-destructive-dry-run --paths cancel,confirm,stale-confirm --dry-run-only --json
bun scripts/agentic/index.ts loading-skeleton-progress-restoration-stress --session default --surfaces sdk-reference,script-template-catalog --fixture delayed-local --cycles 4 --json
bun scripts/agentic/index.ts icon-image-fallback-redaction-stress --session default --surfaces app-launcher,file-search,clipboard-history --fixtures missing-file,corrupt-png,private-local-path,data-uri-redacted --json
bun scripts/agentic/index.ts footer-status-persistence-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog --transitions filter,selection,cmd-k,escape,clear-filter --json
bun scripts/agentic/index.ts keyboard-hint-label-parity-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog,menuSyntaxTriggerPopup --families footer,row-accessory,tooltip,action-catalog --json
bun scripts/agentic/index.ts row-state-parity-without-pointer-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog --states selected,focused,hovered,selected-hovered --json
bun scripts/agentic/index.ts quiet-chrome-card-nesting-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog,promptPopup --chrome quiet --json
bun scripts/agentic/index.ts scroll-shadow-sticky-header-density-stress --session default --surfaces clipboard-history,emoji-picker,file-search,app-launcher,actionsDialog --scroll-positions top,middle,bottom --density compact,default --json
bun scripts/agentic/index.ts popup-focus-keycap-visual-semantics-stress --session default --surfaces actionsDialog,menuSyntaxTriggerPopup,confirmPrompt --json
bun scripts/agentic/index.ts reduced-motion-animation-disable-stress --session default --surfaces main,actionsDialog,menuSyntaxTriggerPopup --fixture reduced-motion --json
bun scripts/agentic/index.ts command-search-highlighting-accessory-badges-stress --session default --hosts main,actionsDialog,app-launcher,menuSyntaxTriggerPopup --query agentic-loop-twenty-three --json
bun scripts/agentic/index.ts clipboard-copy-visual-feedback-stress --session default --hosts file-search,actionsDialog,app-launcher --fixture agentic-copy-preview --pasteboard-scope fixture --no-system-pasteboard --json
bun scripts/agentic/index.ts portal-cancel-return-state-restoration-stress --session default --origins acp-composer,notes --portal file-search --query AGENTS.md --cancel-methods escape,back --fixture repo-file --no-native-picker --json
bun scripts/agentic/index.ts tooltip-hover-focus-affordance-stress --session default --surfaces main,actionsDialog,app-launcher --targets truncated-row,disabled-action,footer-button --fixture agentic-tooltips --input-modes protocol-hover,keyboard-focus --no-native-pointer --json
bun scripts/agentic/index.ts shortcut-recorder-cancel-layering-stress --session default --surface shortcuts --action test-agentic-shortcut --cancel-methods escape,cmd-w,backdrop,parent-click --input-modes protocol-key,protocol-click --sandbox-config --no-config-write --json
bun scripts/agentic/index.ts inline-popover-anchor-resize-stress --session default --families acp-slash,acp-mention,menu-syntax-colon --widths mini,narrow,full --fixture agentic-inline-popover --input-modes protocol-key,protocol-resize --no-native-input --json
bun scripts/agentic/index.ts disabled-footer-hit-target-refusal-stress --session default --surfaces drop-prompt,fields-prompt,path-prompt --fixtures empty-drop,invalid-fields,missing-path --input-modes enter,footer-shortcut,protocol-footer-click --no-native-pointer --dry-run-only --json
bun scripts/agentic/index.ts mini-full-transition-layout-continuity-stress --session default --surfaces main,mini-prompt,fields-prompt,actionsDialog --transitions mini-to-full,full-to-mini,hide-show,return-to-origin --fixture agentic-mini-full-layout --input-modes protocol-key,protocol-resize --no-native-input --no-native-pointer --no-system-pasteboard --local-fixture-only --json
bun scripts/agentic/index.ts filter-input-decoration-chip-layout-stress --session default --surfaces main --queries 'f: AGENTS.md,c: agentic,~/script,:actions,;note,!command,literal\\:chip' --widths mini,narrow,full --scale-factors 1,1.25,1.5 --fixture agentic-filter-input-decorations --input-modes protocol-set-filter,protocol-resize --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --local-fixture-only --json
bun scripts/agentic/index.ts focus-ring-viewport-integrity-stress --session default --surfaces main,actionsDialog,fields-prompt,path-prompt --fixture agentic-focus-rings --input-modes protocol-key,simulate-gpui-event --steps tab,shift-tab,up,down,escape --no-native-input --no-native-pointer --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts warning-banner-action-dismiss-semantics-stress --session default --surface main --fixtures warning,actionable,dismissible,error --input-modes protocol-hover,protocol-click,protocol-key --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --local-fixture-only --json
bun scripts/agentic/index.ts select-prompt-multiselect-keyboard-state-stress --session default --surface select-prompt --fixture agentic-multiselect --choices 24 --selection-steps space,cmd-a,filter-preserve,clear-filter,range-toggle,escape-restore --input-modes protocol-key,batch --no-native-input --no-native-pointer --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts file-search-preview-sanitization-stress --session default --surface file-search --fixture agentic-safe-preview --preview-fixtures text,binary,large-text,missing-file,private-path,unsupported-kind --selection-cycles 8 --filter-cycles 4 --input-modes protocol-set-filter,protocol-key,batch --no-native-input --no-native-pointer --no-native-picker --no-quick-look --no-system-pasteboard --local-fixture-only --json
bun scripts/agentic/index.ts hotkey-prompt-transient-capture-cancel-stress --session default --surface hotkey-prompt --fixture agentic-transient-hotkey --chords cmd+shift+7,ctrl+space --cancel-methods escape,cmd-w --input-modes protocol-key,simulate-key --no-native-input --no-native-pointer --no-config-write --no-global-hotkey-registration --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts process-manager-sort-detail-panel-stability-stress --session default --surface process-manager --fixture agentic-process-table --sort-keys name,cpu,memory,pid --selection-cycles 8 --filter-cycles 4 --input-modes protocol-click,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-process-kill --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts env-prompt-redacted-status-error-recovery-stress --session default --surface env-prompt --fixture agentic-env-status --status-fixtures missing-secret,parse-error,masked-existing,valid-edit --input-modes protocol-set-input,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --no-secret-write --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts command-palette-breadcrumb-route-stack-stress --session default --host main --fixture agentic-actions-breadcrumbs --drill-path parent-action,child-action --filter 'switch' --back-methods escape,breadcrumb-click --input-modes protocol-key,protocol-click,batch --no-native-input --no-native-pointer --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts root-source-chip-action-semantics-stress --session default --queries 'f: AGENTS.md,c: agentic,n: welcome,-c: noise' --actions remove-chip,clear-all,toggle-exclude,open-chip-actions --input-modes protocol-click,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts recent-history-dedupe-root-grouping-stress --session default --fixture agentic-root-recents --sources files,notes,clipboard,dictation,acp-history --query agentic-loop-29-dupe --cycles 6 --input-modes protocol-set-filter,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts inline-attachment-preview-chip-stability-stress --session default --hosts acp-composer,notes --fixture agentic-inline-attachments --origins local-file,fixture-image,fixture-text,script-resource --chip-actions focus,preview,remove,reorder,overflow --input-modes protocol-set-input,protocol-click,batch --no-native-input --no-native-pointer --no-native-picker --no-screen-capture --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts window-title-status-semantics-stress --session default --surfaces main,acp-composer,actionsDialog,promptPopup,notes --states idle,busy,error,dirty,ready --transitions triggerBuiltin,cmd-k,escape,hide-show --input-modes protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts menu-syntax-capture-validation-chip-stress --session default --fixture agentic-capture-validation --cases missing-body-date,missing-date,ready,malformed-url,unresolved-date,dynamic-schema --input-modes protocol-set-filter,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts acp-footer-activity-indicator-stress --session default --hosts acp-composer,notes --fixture agentic-acp-footer-activity --activity-fixtures context-capture,tool-call,plan-update,permission-wait,cancelled,idle-recovered --input-modes protocol-state,batch --agent-fixture scripted-local --no-native-input --no-native-pointer --no-security-prompts --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts acp-model-history-popover-visual-state-stress --session default --families model-selector,local-history --fixture agentic-acp-popover-visual-state --states idle,filtered,empty,loading,current-selection,error-recovered --selection-cycles 8 --filter-cycles 4 --input-modes protocol-set-input,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts acp-context-insertion-preview-parity-stress --session default --sources file-search,browser-history,dictation-history,notes --destination acp-composer --fixture agentic-context-preview-parity --selection-cycles 6 --filter-cycles 4 --insert-modes protocol-accept,batch --input-modes protocol-set-filter,protocol-key,batch --no-native-input --no-native-pointer --no-native-picker --no-quick-look --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
Use adjacent skills when the work crosses boundaries:
$testing-quality-gates for choosing narrow build/test gates.$protocol-automation when stdin JSON, receipts, target identity, waitFor, or batch are the behavior owner.$acp-chat-core, $actions-popups, $keyboard-focus-routing, $launcher-surface-contracts, or $window-resizing.simulateKey does NOT go through GPUI's intercept_keystrokes(). Use triggerBuiltin for ACP Chat entry, not simulateKey Tab.AcpChatView accepts single-char simulateKey for typing, enter for submit, w+cmd for close.ActionsDialog and PromptPopup can expose targeted state snapshots even when they do not expose an independent GPUI key handle. Read state from the popup target first; only escalate to parent-window or native input when you must drive real key delivery.simulateGpuiEvent is better than simulateKey for interceptor bugs, but handle_unavailable means the target has no usable runtime handle for that path. Treat that as a proof-design problem, not a cue to spam retries.captureWindow filters out windows under 100x100 (tray icons).unset ANTHROPIC_API_KEY.macos-input.ts --ensure-focus) instead of simulateKey ā synthetic keys bypass GPUI's native key interception and do not faithfully exercise picker selection behavior.getAcpState to verify picker acceptance, cursor landing, and input content ā do not rely solely on screenshots for ACP state verification.waitFor commands via session.sh rpc for deterministic ACP state transitions ā do not use fixed sleeps as proof of ACP state.