Run any Skill in Manus with one click

$pwd:

agentic-testing

Name: Agentic Testing
Author: johnlindquist

// Human-first runtime testing for Script Kit GPUI: operate the real app through visible user paths to surface UX/UI interaction bugs, then back findings with receipts, screenshots, exact targets, and cleanup.

Run Skill in Manus

$ git log --oneline --stat

stars:27

forks:6

updated:May 19, 2026 at 04:08

File Explorer

2 files

SKILL.md

readonly

related-skills.json

same repository

agy-script-kit-devtools.md

from "johnlindquist/script-kit-next"

Use the local agy CLI as a fast Script Kit GPUI app inspector by prompting it to drive existing script-kit-devtools primitives, capture logs, and produce a compact investigation result from a user bug report or inspection prompt.

2026-05-2427

new-agent.md

from "johnlindquist/script-kit-next"

Create mdflow-backed agent files for Script Kit. Compatibility path — prefer creating skills (SKILL.md under a plugin's skills/ directory) for new reusable AI work.

2026-05-2327

manage-notes.md

from "johnlindquist/script-kit-next"

Manage the Notes window — creating, searching, editing, and automating notes via the SDK and automation protocol.

2026-05-2027

isolated-devtools-session.md

from "johnlindquist/script-kit-next"

Canonical bootstrap for isolated Script Kit GPUI DevTools sessions: one front door script (classify, verify-script, start, prove, cleanup), repo-level helpers under scripts/agentic/, bounded timeouts, JSON stdout, progress stderr. Use when agents need parallel DevTools workers, isolated runtime proof, SK_VERIFY script gates, or dev-watch reuse without starting a second GPUI under ./dev.sh.

2026-05-2027

acp-chat-core.md

from "johnlindquist/script-kit-next"

ACP Agent Chat lifecycle, AcpChatView, embedded/detached chat windows, agent/model selection, setup cards, streaming, cancellation, and chat close/reuse behavior.

2026-05-1927

acp-context-composer.md

from "johnlindquist/script-kit-next"

ACP composer input, slash commands, @mentions, context parts, attachment tokens, pasted text/image, /new-script, focused target, and context staging.

2026-05-1927

package.json

"author": "johnlindquist"

"repository": "johnlindquist/script-kit-next"

View GitHub Repository View Creator Repositories

$ install --global

$ download --local

Run Skill in Manus

$ useful --forSOC

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations15-1253L4

Run any Skill with one click

name	agentic-testing
description	Human-first runtime testing for Script Kit GPUI: operate the real app through visible user paths to surface UX/UI interaction bugs, then back findings with receipts, screenshots, exact targets, and cleanup.

Agentic Testing

This canonical repo-local skill owns human-first runtime testing and session cleanup for Script Kit GPUI. It combines the current .agents/skills routing policy with the full operational recipe previously kept under .codex/skills / .claude/skills.

The core purpose is to act like a real user would with the actual app so interaction, UX, and UI bugs surface before code is declared correct. Receipts, state snapshots, screenshots, and exact targets exist to make that user-like exploration trustworthy; they are not a substitute for reproducing the real user path.

This skill is not a receipt factory, a contract checklist, or a way to avoid looking at the app. Its foundation is interaction-driving testing: open the real product, exercise the real surface, perform the action a user would perform, and surface the UX/UI bug as an observable interaction failure. If a user reports something visual, behavioral, confusing, broken, clipped, hidden, misfocused, misrouted, unclickable, unreadable, or otherwise wrong in the UI, the default response is to drive the app like a user until the bug is reproduced, disproved, fixed, or explicitly blocked. A state receipt without user-path interaction is not enough.

Human-First Purpose

For user-reported UX/UI bugs, start by recreating the user's visible workflow against the real product surface. Open the app when the report is about what a user sees or does, drive the same entry path, interact with the visible controls, and capture evidence that another person can inspect.

State-first receipts are still required, but they support the user-path proof. A fail-closed receipt contract is useful only after the skill has identified the actual surface, entry path, visible invariant, and interaction sequence the user cares about.

Never let "state-first" become "user-last." State is the witness, not the experience. The experience is the real app being driven through the user's path so bugs actually reveal themselves.

Canonical Ownership

Use this .agents/skills/agentic-testing file as the source of truth for Codex routing and runtime proof. Legacy .codex/skills/agentic-testing and .claude/skills/agentic-testing files are compatibility snapshots only; mine them only when auditing history.

Primary paths and concepts:

scripts/agentic/, .test-output/, .test-screenshots/
Human-like runtime exploration, exact automation targets, screenshots, native input escalation, and cleanup
State-first receipts from getState, getElements, inspectAutomationWindow, waitFor, batch, ACP state/probe APIs, and recipe receipts

First Reads

Start with these sources before editing or proving behavior:

.agents/subagents/agentic-testing-reader.md for broad or high-risk investigation.

Workflow

Review AGENTS.md, the owning skill, and current source context before editing.
Identify the behavior owner before editing shared files. Path ownership is a hint; the user-visible behavior and documented contract decide the owner.
Check adjacent-skill boundaries before changing shared code.
Make the narrowest change that preserves the domain invariant.
For user-visible UX/UI reports, reproduce the real visible workflow first; then verify with the smallest receipt-backed proof that can fail if the behavior regresses.
Report changed files, proof tier, exact commands or receipts, adjacent skills consulted, cleanup status, and remaining risk.

Do not use this skill as the primary owner for test authoring or product ownership decisions; load $testing-quality-gates, $protocol-automation, $script-kit-devtools, or the relevant domain skill when those surfaces own the behavior.

When to Use

After implementing any UI, protocol, or behavior change
For user-reported interaction, UX, or UI bugs where the app should be opened and exercised like a user would
For routine UI and behavior work, this is the default smoke test
When Oracle's autonomous verification says "Run the agentic-testing skill"
Before marking a task as complete
Especially after changes to: prompts, views, keyboard handlers, ACP chat, actions dialog

Human-First, Receipts-Backed Default

Most verification runs should complete in seconds, not minutes, but speed must not erase the user's visible workflow. Default to the smallest proof tier that answers the question without stealing focus or blocking the user.

No-runtime proof: docs, skills, source audits, or focused tests only. Do not launch the app if runtime evidence is unnecessary.
Visible user-path proof: for user-reported UX/UI/interaction bugs, open or reveal the real app, enter through the same product path, perform the user action visibly, and capture at least one screenshot or equivalent visual receipt when the bug is about what the user sees.
State-first runtime proof: reuse a warm session and prove behavior with getElements, getState, waitFor, batch, and exact automation targets. This backs the visible workflow and is the default for routing, selection, focus, popup ownership, and protocol bugs.
Visual diagnostics proof: capture screenshots when layout, styling, visibility, animation, or real-shell composition is part of the acceptance criteria, and tie those pixels back to semantic receipts.
Native input and focus enforcement: use when protocol-level and GPUI-level paths cannot exercise the real user bug or when the report is specifically about real keyboard, pointer, AppKit focus, or OS delivery.

Rules:

If the proof does not drive the actual app surface through a user-recognizable interaction path, it has not tested a UX/UI bug. It has only inspected implementation state.
If the user report is about visible UI or interaction and your plan never opens/reveals the real app, stop and redesign the proof around the user's path.
If your plan starts with cold start -> show -> screenshot -> log scrape, add state/elements receipts so the screenshot is grounded in the correct surface.
Reuse an existing healthy session before starting a new one.
Avoid stealing OS focus unless the bug specifically involves native focus or the proof requires real keyboard or mouse delivery.
Prefer exact target threading and targeted receipts over reading generic global state.
If a non-visual proof is taking longer than about 10 seconds, redesign the proof before continuing.

Safety Rules (MANDATORY)

NEVER delete files, directories, or data
NEVER modify databases, user data, or production state
NEVER run destructive commands (rm -rf, DROP, git push --force, git reset --hard)
NEVER send requests to external services, APIs, or webhooks
NEVER modify files outside the project directory
NEVER commit, push, or modify git history
ALL verification is read-only: build, launch, screenshot, grep, read logs
Temp pipes/logs may live under /tmp via the session wrapper. Runtime captureWindow screenshots must go in project .test-screenshots/ / test-screenshots/ or ~/.scriptkit/screenshots
The app runs locally only — never connect to production
Every verification run MUST stop every Script Kit process/session it started before reporting results

AFK-Safe Proof Gate

Before every runtime proof, assert the allowed and blocked capabilities explicitly. Default to --local-fixture-only, --dry-run-only, --no-network, --no-external-services, --no-system-pasteboard, --no-native-picker, --no-quick-look, --no-system-settings, --no-tcc-mutation, --no-security-prompts, --no-native-input, and --no-native-pointer unless the proof documents a protocol-only exception.

Receipts for AFK-safe runs should include an afkSafe: true equivalent plus blockedCapabilities and usedCapabilities lists. If the app or harness cannot prove that an unsafe capability stayed blocked, fail closed and file the missing receipt.

User Bug Triage

Turn a user-filed UI/UX bug into a proof plan before choosing or adding a recipe:

Surface and real entry path.
Visible user workflow, including the app actions a person would take.
Observable invariant the user would notice.
Minimal fixture and relevant state generation.
Unsafe operations to simulate rather than perform.
Screenshot, visual receipt, or explicit reason visual capture is unnecessary.
Expected receipt fields and cleanup proof.
Pass/fail oracle, including what counts as reproduced, fixed, blocked, or inconclusive.

Do not answer a user-filed UX/UI bug with only "missing receipt", "blocked by instrumentation", or "state proof unavailable" unless you first attempted or explicitly designed the visible user interaction that would expose the bug. Missing instrumentation is a blocker for final proof, not permission to skip user-like exploration.

Stop Rule

After two failed proof attempts caused by missing receipts, stale targets, unavailable handles, capture mismatch, or unsafe capability refusal, stop broadening the test. Classify the blocker as product bug, instrumentation gap, unsafe-to-prove, or test harness bug. Do not add sleeps, screenshots, focus hacks, native input, or larger recipes as a workaround.

Forbidden-State Simulation

Permission denied, security prompt pending, setup required, network unavailable, pasteboard blocked, native picker unavailable, and destructive confirmation states must be represented by fixture/protocol state only. Receipts must prove no real OS prompt, settings write, install flow, API call, pasteboard mutation, or destructive command occurred.

Bug Result Schema

For user-filed bug smoke tests, report a result that is more specific than generic PASS/FAIL: reproduced, not-reproduced, fixed, blocked-by-missing-receipt, blocked-by-unsafe-operation, or inconclusive. Include exact receipts/screenshots used, cleanup status, and remaining user-visible risk.

Surface Identity Rules (MANDATORY)

Always verify the real user-facing surface through its real runtime entry path first.
For Script Kit UI, prefer stdin JSON commands, built-in routing, and real app windows over ad hoc component harnesses.
Never treat an isolated GPUI entity, temporary debug window, story, off-screen render, or synthetic wrapper as proof of a real product surface unless the user explicitly asks for component-level verification.
Before trusting a screenshot, confirm the captured surface matches the intended product surface:
- same entry path
- same window/shell
- same wrapper/root chrome
- same footer, sizing, and layout structure
If the screenshot does not clearly match the real surface, stop and re-route verification to the real surface instead of iterating on the fake one.
For ACP specifically, AcpChatView in isolation is not sufficient proof. Default to the real ACP entry path (triggerBuiltin tab-ai, detached chat window routing, or another production runtime path) before using any synthetic ACP harness.

Visual Diagnostics

Visual proof must connect what the user sees to structured layout and semantic receipts. Use visible text, layout measurement, and screenshot-to-semantics diagnostics before trusting a screenshot as UX proof.

Use screenshot-semantics-visual-consistency-stress for pass-now visual consistency. It checks strict capture identity, non-blank content audit, getState, getElements, selected row, focus receipt, footer actions, popup crop bounds, and semantic visible text labels.
visibleTextMode:"semanticElements" means the harness found visible text from automation element labels. It is not OCR and not clipping proof.
For visible text, require text bounds, rendered text bounds, measured width, available width, glyph/container bounds, overlap pairs, and truncation metadata. Clipped or ellipsized text is acceptable only when the receipt proves intentional truncation plus tooltip or accessible full text.
For layout measurement, require rem/font/scale metrics, window/content/container/scroll/input/footer bounds, footer/input ownership, and before/after layout-shift fingerprints for filtering or resizing.
For screenshot-to-semantics checks, require the screenshot crop target to match the exact automation window and semantic surface, then cross-check selected row, focus ring, footer actions, and visible text against getElements receipts.
Do not treat pixels alone as proof. A PNG can show that something rendered, but it does not prove the selected row, focus target, visible text, or footer actions are the correct semantic objects unless the receipt ties pixels back to the same target window.
Do not claim text fits from a screenshot alone. Use visible-text-clipping-overlap-stress for clipping and overlap audits; it opens the real main window and combines getElements, getLayoutInfo, and AppKit text measurement to report text bounds, measured width, available width, clipping state, truncation intent, tooltip or accessible full text, overlap pairs, and cleanup.
Do not claim rem/layout correctness from window bounds alone. Use layout-measurement-regression-stress; it opens the real main window, records getLayoutInfo component bounds plus bounded getElements semantics, then drives setFilter and reset to prove filter-churn layout fingerprints stay stable. Treat non-main surface warnings as remaining coverage gaps, not proof.
Do not claim dynamic choice sizing works from source constants alone. Use main-menu-dynamic-choice-resize-stress; it opens real small and large choice prompts in one app session, compares visibleChoiceCount against the fixture counts, measures getWindowBounds before/after, requires height growth with stable width, and cleans up through Escape.
Do not claim Notes auto-resize works from source constants alone. Use notes-window-resize-stress; it opens the real Notes window against a sandbox notes DB, drives targeted Notes batch.setInput through the editor path, measures Notes window bounds before tall content, after tall content, and after short content, and requires height growth, height shrink, stable width, and cleanup.
Do not claim a tall div or container is scroll-safe from a screenshot alone. Use div-container-scroll-overflow-stress; it opens a real DivPrompt, requires a DivContent layout component instead of launcher ScriptList/PreviewPanel components, estimates fixture content height against the viewport, proves scrolling is required, checks cleanup, and warns until div scroll position is exposed as a first-class receipt.
Do not claim contrast/readability from screenshots alone. Use visual-contrast-readable-state-stress; it opens the real main window for visible semantics and collects AGENTIC_THEME_CONTRAST_RECEIPT foreground/background token samples with contrast ratios, minimum ratios, pass flags, theme fingerprints, and cleanup.
For broad exploratory UX coverage, use bun scripts/agentic/user-story-audit.ts --limit 100 --max-ms 60000. It converts existing stress recipes into user-shaped stories, skips stories already exercised in the current thread by default, writes .test-output/agentic-100-user-story-audit-*.json, and separates pass, fail_closed, blocked_precondition, runtime_failure, and timeout. Treat fail-closed results as missing proof/backlog, never as a UI pass.

Hard Interaction Boundaries

When a user flow spans stacked modals, cross-surface export, or app restart recovery, require one receipt that proves ownership boundaries before sending input.

For a modal stack, prove the topmost owner before each Escape, Cmd-W, or Enter action, then prove the child closed or executed without mutating the parent selection/focus unless that parent was the target.
For cross-surface export provenance, prove the payload origin surface, generation, selected semantic id, redacted preview, destination identity, stale-source rejection, and cleanup. Clipboard or drag side effects alone are not proof.
Restart/recovery recipes must gate every promoted target with a session epoch. If the epoch changes, the harness must refuse native, batch, and GPUI input before delivery, then re-resolve the exact target.
For stale-target recovery, prove stale window targets are rejected, exact targets are re-resolved after restart or id churn, no stale input is delivered, and session cleanup ran. Never retry by kind without an identity receipt.
For menu syntax ambiguity, prove tolerant diagnostics, skipped malformed fragments, selected command identity, and no accidental execution before submitting any command.
For IME composition, prove composition start/update/commit boundaries, no premature submit/actions, and final committed text semantics. Plain key events are not enough.
For selected-text fallback, prove permission denial/staleness, redaction, fallback source, and safe action disablement. Never trust stale frontmost-app context or raw selected text logs.
For display migration visual bounds, prove source/target display identity, scale/rem metrics, focus/selection preservation, visible text bounds, screenshot-to-semantics alignment, wrong-display capture rejection, stale migration rejection, and no popup/main clobbering.
For native picker or external app return, prove origin surface identity, handoff request id, picker/external window identity, restored focus/selection/cursor, stale or foreign window event rejection, and no submit or selection mutation during handoff.
For drag cancellation, prove drag session identity, scoped payload fingerprint, redacted preview, hover/drop target cleanup, origin focus/selection restoration, no clipboard/file/attachment/prompt side effects, stale drag rejection, and foreign drop rejection.
For runtime appearance churn while focused input is active, prove surface/window identity, focus, text, visible text, cursor/selection, rem/font/scale/layout metrics, theme and renderer token generations, stale token repaint rejection, and wrong-surface mutation rejection.
For power resume recovery, prove pre-sleep target identity, post-wake target generation, stale target refusal before native/batch/GPUI/screenshot delivery, exact re-resolution, fresh state/elements/screenshot receipts, focus/selection preservation, and cleanup.
For menu/tray/notification interruption, prove active modal/prompt identity, interruption identity, wrong-surface action rejection, topmost modal preservation, no focus steal, no selection/input/cursor mutation, no prompt submit, and focus restoration.
For streaming progress cancellation, prove stream run identity, monotonic progress samples, visible progress text, cancellation request/ack ordering, stale post-cancel chunk rejection, no stale repaint, screenshot-to-state revalidation, focus/cursor restoration, no accidental submit, and cleanup.
For dictation/media permission readiness churn, prove passive microphone/model setup, permission and model readiness generations, churn event ordering, target identity, transcript generation identity, wrong-target delivery rejection, no auto-submit, no System Settings or TCC mutation, focus/cursor preservation, and cleanup.
For animation frame capture determinism, prove app animation frame ids, capture sequence ids, state/elements/screenshot receipts per sampled frame, visible text/layout fingerprints, motion occlusion pairs, stable frame ordering, stale-frame rejection, wrong-window rejection, blank-frame rejection, and cleanup.
For accessibility tree semantic parity, prove visible controls, automation elements, and AX nodes share roles, labels, focus order, tab order, disabled states, safe keyboard activation semantics, hit targets, screenshot-to-semantics alignment, stale-tree rejection, wrong-window rejection, and cleanup.
For RTL/bidirectional/emoji text rendering, prove direction runs, bidi levels, grapheme clusters, emoji ZWJ and combining mark sequences, cursor visual positions, selection rectangles, visible/rendered text bounds, truncation intent, search/filter semantics, stale layout rejection, wrong-surface mutation rejection, no accidental submit, and cleanup.
For high-volume virtualized list stability, prove fixture identity, row identity, virtualization generations, selected-row reanchor, selection reanchor, scroll anchor preservation, rapid filter ordering, stale result rejection, duplicate-key rejection, blank-row rejection, footer-safe selection, screenshot-to-semantics consistency, and cleanup.
For input-device modality transitions, prove hover/focus/selection affordances, pointer hover, keyboard focus, selection, trackpad/wheel scroll anchors, shortcut ownership, activation ownership, stale modality event rejection, wrong-surface input rejection, no accidental submit, screenshot-to-state revalidation, and cleanup.
For multi-context attachment dedupe/provenance, prove file, screenshot, selected-text, MCP resource, script resource, and clipboard snippet origins across ACP Composer and Notes, with attachment provenance, destination generations, dedupe keys, provenance fingerprints, redacted preview, remove/reorder receipts, stale provenance rejection, duplicate-id rejection, no privacy leaks, and cleanup.
For visual contrast readable state checks, prove active inactive disabled focused error loading states across themes, scale factors, and surfaces with theme token fingerprints, rem/scale metrics, text/color/bounds receipts, contrast ratios, non-color state cue coverage, screenshot-to-state revalidation, stale theme token rejection, wrong-surface rejection, and cleanup.
For empty/error/retry state UX, prove empty, loading, error, retry, and recovered states with visible text, semantic retry identity, footer-safe actions, stable selection, and no stale error after recovery.
For form validation and inline error recovery, prove invalid submit prevention, focus first invalid field, preserve user input, inline error identity, clear errors on valid edits, final submit recovery, prevent accidental submit, and no cross-field error leakage.
For navigation/back-stack history, prove transition generations, route stack depth, actions discoverability, disabled/no-op affordances, Escape/back/Cmd-K close behavior, and return-to-origin restore selection, filter, scroll, footer, and focus without stale surface state.
For long text wrapping/resizing UX stress, prove fixture identity, width mode, resize generation, full text, visible text, text/rendered/element bounds, available width, measured width, wrap line count, truncation intent, tooltip or accessible full text, overlap pairs, footer/input collision, focus and selection preservation, stale resize rejection, wrong-surface rejection, and cleanup.
For actions/command discoverability UX, use actions-command-discoverability-noop-stress to drive real Cmd-K action popups and prove visible action row ids, labels, sections, shortcuts, destructive/enabled flags, context identity, safe Escape cleanup, and no accidental execution. Treat disabled/no-op state warnings as coverage gaps until fixtures expose those row kinds.
For Notes window resizing UX, use notes-window-resize-stress to prove a real Notes window opened through openNotes, targeted Notes editor input changed through batch.setInput, a sandbox notes DB isolated user data, tall content grew the window, short content shrank it, width stayed stable, Notes stayed visible, and the launched session was stopped.
For dense list/detail preview readability, prove selected row identity, preview source identity, preview title/body bounds, metadata chip readability, footer action readability, filter generations, selection generations, resize generations, stale-preview rejection, row reanchor, focus preservation, no column/footer overlap, and cleanup.
For transient toast/notification feedback, prove queue generation, bridge generation, visible text, duplicate collapse, autohide/dismiss ordering, bounds/overlap, footer/input non-blocking, stale rejection, and no action execution from toast UI.
For destructive confirmation, prove dry-run-only fixture identity, confirm prompt identity, focused button, Enter/Escape resolution, no mutation before confirm, no mutation after cancel, no real system command request, stale/wrong-surface rejection, and parent focus/selection/filter restoration.
For loading skeleton/progress restoration, prove request/result generations, skeleton rows, progress text/percent monotonicity, activation blocking while loading, stale loading/progress/result rejection, skeleton cleanup after results, and selection/focus/filter/scroll restoration.
For icon/image fallback redaction, prove requested image source kind, redacted source fingerprint, fallback icon kind, fallback reason, image load generation, no raw path/URL/content leakage, stale image rejection, accessible label preservation, and cleanup.
For footer/status persistence, prove owner, native footer surface id, rendered buttons, shortcut labels, status generation, persistence across filter/selection/actions transitions, duplicate-footer rejection, stale-status rejection, wrong-surface rejection, and cleanup.
For keyboard hint label parity, prove footer, row accessory, tooltip, action catalog, normalized shortcut tokens, platform glyphs, disabled-state parity, activation owner, no accidental execution, stale-hint rejection, wrong-surface rejection, and cleanup.
For row state parity without native pointer input, prove selected, focused, hovered, and selected-hovered row states through semantic row ids, state/elements receipts, modality receipts, tokenized fill/focus/text/icon states, stale-row rejection, wrong-surface rejection, no accidental execution, and cleanup.
For quiet chrome/card nesting regressions, prove shell/content/row/popup/footer chrome layers, border/fill/shadow tokens, card depth, inset/gap/radius, duplicate-border rejection, opaque-fill rejection, stale-token rejection, wrong-surface rejection, and cleanup.
For scroll shadows, sticky headers, and density drift, prove scroll position, viewport/content bounds, sticky header bounds/z-index, scroll shadow opacity tokens, row/header/input/footer heights, rem/scale metrics, footer-safe viewport, selected-row visibility, stale-scroll rejection, wrong-surface rejection, and cleanup.
For popup focus/keycap visual semantics, prove popup owner identity, focused button/keycap parity, normalized shortcut glyphs, danger semantics on labels rather than keycaps, parent focus/selection preservation, stale focus rejection, wrong-surface rejection, no accidental execution, AFK-safe flags, and cleanup.
For reduced-motion animation disable behavior, prove fixture-only reduced-motion policy, animation/transition generations, stable opacity/transform/frame receipts, disabled shimmer/spinner/pulse motion, focus/selection/cursor preservation, stale motion rejection, wrong-surface rejection, no System Settings or TCC mutation, AFK-safe flags, and cleanup.
For command search highlighting/accessory badges, prove query/search generations, highlighted ranges, command row identity, accessory badge order/kinds/tooltips, disabled/no-op/loading reasons, action-catalog parity, stale highlight/badge rejection, wrong-host rejection, no accidental execution, AFK-safe flags, and cleanup.
For clipboard copy visual feedback, prove fixture-scoped pasteboard isolation, visible copied state, copied-state duration, copy toast receipt, redacted preview, payload fingerprint, unchanged system pasteboard fingerprints, stale copy rejection, wrong-host rejection, no accidental paste, AFK-safe flags, and cleanup.
For portal cancel/back return restoration, prove origin generation, draft/cursor/selection/filter/scroll before the portal, portal session identity, Escape/back cancel receipts, return target identity, restored focus/draft/cursor/selection/filter/scroll, no context insertion, no prompt submit, stale/foreign/wrong-origin rejection, AFK-safe flags, and cleanup.
For tooltip hover/focus affordances, prove protocol-hover and keyboard-focus triggers, target identity, tooltip generation, text/kind/anchor/bounds/placement, hover delay, accessible description parity, Escape/scroll/focus-loss dismissal, no focus steal, target focus preservation, no target/footer/popup-owner coverage, stale/wrong-surface rejection, AFK-safe flags, and cleanup.
For shortcut recorder cancel/layering UX, prove compact modal layering, parent/recorder bounds, placeholder and cancel affordance, Escape/Cmd-W/backdrop/parent-click cancellation, no chord capture, unchanged config fingerprints, no global hotkey registration, parent focus/selection restoration, stale/wrong-parent rejection, AFK-safe flags, and cleanup.
For inline popover anchor/resize UX, prove family/origin/popup identity, trigger range, anchor and popup bounds before/after resize, selected row visibility and identity, synopsis/footer bounds, no parent clipping or viewport overflow, z-order above parent, no focus steal, keyboard fallback, strict capture target, blank screenshot rejection, stale resize rejection, AFK-safe flags, and cleanup.
For disabled footer hit-target refusal, prove active footer/native footer identity, disabled primary action label/reason/visual/accessibility state, Enter/shortcut/protocol-click refusal, Cmd-K action availability, no submit receipt, unchanged side-effect counts and state fingerprints, focus/selection/filter preservation, stale/wrong-surface rejection, AFK-safe flags, and cleanup.
For mini/full transition layout continuity, prove Mini/Full/hide-show/return-to-origin transitions preserve rem/scale, window/content/input/list/footer bounds, native footer identity, focus ring bounds, selected row visibility above footer, no input/footer overlap, no clipping, no popup/main clobbering, screenshot-to-semantics alignment, strict capture target, blank screenshot rejection, stale mode rejection, AFK-safe flags, and cleanup.
For filter input decoration chip layout, prove rendered input text, stripped search text, chip ranges/roles/bounds, cursor and placeholder bounds, measured/available width, visible text, decoration/input generations, stale decoration clearing, no chip text/cursor/placeholder/footer overlap, no horizontal clipping, accessible full text, screenshot-to-semantics alignment, strict capture target, stale generation rejection, AFK-safe flags, and cleanup.
For focus ring viewport integrity, prove focused semantic id and owner, focus ring/focused element/viewport/content/footer/popup bounds, ring visibility, no clipping, ring within viewport and above footer, no footer/popup occlusion, stable tab order, preserved selection/scroll anchor, focus restoration after Escape, no activation/submit, stale focus rejection, AFK-safe flags, and cleanup.
For warning banner action/dismiss semantics, prove banner identity, visible text/bounds/text bounds, action and dismiss semantic ids, hover/focus state, action-vs-dismiss click receipts, non-color state cue, contrast ratio, no footer/input obstruction, stale banner rejection, AFK-safe flags, and cleanup.
For SelectPrompt keyboard multi-selection state parity, prove choice count, focused/selected/checked/visible row ids, selection count and footer labels, filter and selection generations, Cmd-A/space/range-toggle receipts, filter preservation, checked rows matching state/elements, no submit/activation, stale selection rejection, AFK-safe flags, and cleanup.
For File Search safe preview sanitization, prove selected row/file identity, preview generation/source/render kind/title/visible text/bounds/text bounds, byte limit/truncation, binary/missing/unsupported fallbacks, private path redaction, no raw path leak, no network/external service, no Quick Look/native picker/pasteboard mutation, stale preview rejection, and cleanup.
For HotkeyPrompt transient capture/cancel UX, prove prompt type and surface identity, capture panel/input semantics, placeholder text, captured chord tokens and HotkeyInfo, simulateKey capture, Escape/Cmd-W cancellation, null submit on cancel, unchanged config fingerprints, no global hotkey registration, no shortcut recorder route, parent focus restoration, stale/wrong-surface rejection, AFK-safe flags, and cleanup.
For Process Manager sort/header/detail panel stability, prove fixture identity, table header semantic ids, sort key/direction/generation, non-selectable section headers, selected process/pid identity, detail panel generation/source/title/metrics, CPU/memory/PID parity, row reanchor after sort, header aria-sort labels, disabled kill action, no process signal, stale sort/detail rejection, AFK-safe flags, and cleanup.
For EnvPrompt redacted status/error recovery, prove prompt/fixture identity, status generation/kind/text, inline error and first-invalid-field semantics, masked value visibility, secret redaction and fingerprinting, no raw secret leak, no secret/config writes, valid edit clearing errors, disabled submit reason, focus preservation, stale/wrong-field rejection, AFK-safe flags, and cleanup.
For command palette breadcrumb route-stack UX, prove host/topmost owner identity, route stack depth, breadcrumb trail labels and semantic ids, active route id, parent/child snapshots, drill-down push receipt, breadcrumb and Escape back receipts, preserved search text, restored selection/scroll anchor, no pre-drill onSelect, no accidental execution, stale route rejection, AFK-safe flags, and cleanup.
For root source-chip action semantics, prove rendered input text, stripped search text, source filter set, source chip semantic ids and roles, remove/clear-all/toggle-exclude/action-menu receipts, decoration generation, preflight indicators, non-selectable status chips, grouped rows suppress disallowed sources, blocked history recall, selection preservation, no status-as-action subject, stale chip rejection, AFK-safe flags, and cleanup.
For recent/history dedupe root grouping stability, prove fixture snapshot and source catalog generations, query/root/passive frame keys, visible result roles, group order, contiguous Files section, stable Search Files continuation, dedupe keys and rejected collisions, metadata-only history rows, no transcript/body leaks, stable selection key, row fingerprints across cycles, fallback suppression, stale passive publish rejection, AFK-safe flags, and cleanup.
For inline attachment preview chip stability, prove fixture attachment set identity, host surface/composer generation, chip semantic ids/kinds/labels/bounds, redacted preview fingerprints, overflow/focus/remove/reorder receipts, cursor and selection preservation, no raw path/content leak, no native picker/screen capture/pasteboard/network, stale attachment rejection, AFK-safe flags, and cleanup.
For window title/status semantics, prove resolved target identity, automation/native/semantic titles, visible status text, title/status generations, transition receipts, detached window parity, attached popup parent title preservation, error recovery, stale title/status rejection, no focus steal, AFK-safe flags, and cleanup.
For menu syntax capture validation chips, prove fixture catalog identity, filter input text, main hint snapshot, capture validation status, status chip labels, missing/malformed/unresolved field labels, fragment preview rows, priority choices row, Enter prevention when invalid, no payload write, no handler spawn, stale validation rejection, AFK-safe flags, and cleanup.
For ACP footer activity indicators, prove fixture agent event stream identity, host/footer owner, native footer surface id, GPUI/native footer dot status, activity status transitions, context-capture/tool-call/plan-update/permission-wait/cancelled/idle states, footer repaint generation, stable pulse tokens, preserved model label, no global AI footer button, no agent process spawn, no security prompt, stale/wrong-host rejection, AFK-safe flags, and cleanup.
For ACP model/history popover visual state, prove popup catalog identity, family/automation id/kind, anchor and popup bounds, selected/focused row ids, row visual-state tokens, current-model and history-recency badges, redacted history previews, empty/loading/error-recovered states, synopsis bounds, filter selection preservation, stale/wrong-popup rejection, AFK-safe flags, and cleanup.
For ACP context insertion preview parity, prove source/destination identities, portal session id, selection and preview generations, selected row id/title/kind/fingerprint, accepted context URI, inserted token alias and preview fingerprint, composer generation and replacement range, row preview matching inserted context, selection preservation, stale/selection-drift/wrong-destination rejection, no raw content leak, no picker/Quick Look/pasteboard/network/submit, AFK-safe flags, and cleanup.
For ACP slash/mention provider visibility, prove provider hint catalog identity, popup family, trigger/query text, readiness generation, provider visibility rows, hint text, unavailable/loading/error-recovered/filtered-empty states, hidden-until-resource-available behavior, provider-specific rows for slash and mention, selected/focused row ids, disabled-provider refusal, stale generation rejection, no native picker/Quick Look/network/submit, AFK-safe flags, and cleanup.
For ACP composer token keyboard edit parity, prove host surface identity, composer generation, token semantic ids/kinds/aliases/bounds, cursor-before/after-token positions, atomic Backspace/Delete removal, range removal, move-left/move-right receipts, token order before/after, pending context preservation, slash skill context preservation, pasted metadata preservation, cursor selection preservation, no partial token text leak, stale/duplicate/wrong-host rejection, no system pasteboard/native input/submit, AFK-safe flags, and cleanup.
For ACP transcript stream retry virtualization, prove fixture transcript/thread generations, virtualized message window, visible/message row ids, stream run/chunk sequence, monotonic chunk append, scroll anchors, bottom stickiness, user-scrolled-away preservation, assistant error identity/text, retry button identity, retry draft/request/recovery generations, no stale error after recovery, stale chunk and wrong-message retry rejection, stable virtualized row identity, no blank rows, no transcript body leaks, no agent process spawn/security prompt/network/submit, AFK-safe flags, and cleanup.
For ACP plugin skill entry thread affinity, prove fixture skill catalog identity, entry path, host surface identity, resolved ACP target, target thread id, embedded/detached thread reuse, selected skill id and file fingerprint, slash token text/range, pending skill context part URI, context binding to the target thread, composer generation, return origin snapshot, stale launcher and detached thread rejection, no auto-submit/agent process/security prompt/network, AFK-safe flags, and cleanup.
For Notes cart ACP handoff dedupe, prove sandbox notes store identity, fixture note ids, active note id, cart snapshot generation, cart item ids/dedupe keys, rejected duplicates, handoff session id, destination host identity/generation, staged context URIs, inline aliases, redacted preview fingerprints, dry-run consume generation, cancel restoration, switch-note cleanup, wrong-note and stale-cart rejection, no raw note body leak, no user notes mutation/network/agent process, AFK-safe flags, and cleanup.
For root Files source-filter pagination footer UX, prove fixture file provider identity, source filters, rendered and stripped input text, root frame key, provider/page generations, page size, visible file row ids/fingerprints, Search Files continuation row, selected stable key before/after, selected row visible above the footer, main list scroll metrics, near-bottom page request, page append and delayed provider publish not replacing the frame, duplicate key rejection, fallback suppression, non-selectable status chips, Quick Look/native picker/pasteboard/network/submit refusal, stale page rejection, AFK-safe flags, and cleanup.
For File Search directory breadcrumb restoration, prove fixture directory tree identity, redacted breadcrumb segments, only-in-filter chip identity, rendered and stripped search text, visible file rows, directory rows before/after navigation, selected file before/after, breadcrumb click reanchor, filter preservation, back/forward stack depth, scroll anchor restoration, preview generation, no raw path leak, native picker and Quick Look refusal, stale directory rejection, wrong-origin rejection, AFK-safe flags, and cleanup.
For Emoji Picker skin-tone/category UX, prove fixture emoji catalog identity, category tabs, selected category, sticky header bounds, skin-tone palette identity/bounds, variant ids, selected skin-tone token, row ids, ZWJ sequence ids, grapheme fingerprints, search generation, highlighted ranges, accessible label parity, preview glyph bounds, palette dismissal, category-switch selection preservation, no pasteboard mutation or emoji insertion, stale palette rejection, wrong-category rejection, AFK-safe flags, and cleanup.
For root Windows source-filter activation refusal, prove fixture window provider identity, source filters, rendered and stripped input text, root frame key, window snapshot and z-order generations, visible window row ids/fingerprints, selected stable key before/after, selected row visibility, actions subject stable key, dry-run activation receipt, Enter activation refusal, no native window activation or focus steal, duplicate key rejection, stale snapshot rejection, non-selectable status chips, AFK-safe flags, and cleanup.
For Notes markdown preview scroll sync, prove sandbox notes store identity, fixture notes, active note before/after, markdown fixture ids, editor and preview generations, rendered markdown block ids, preview fingerprints, cursor and selection ranges, editor and preview scroll anchors, sync delta, split pane bounds, preview-toggle and switch-note cleanup, focus restoration, no user notes mutation or raw note body leak, stale preview and wrong-note rejection, no pasteboard/network/external service/native input, AFK-safe flags, and cleanup.
For Quick Terminal ANSI scrollback search readability, prove fixture terminal transcript identity, terminal surface id, transcript generation, ANSI/SGR runs, wide-cell graphemes, combining marks, hyperlink spans with redacted fingerprints, stderr blocks, prompt continuation rows, viewport rows/range, search generation, search hit ids, highlighted cell ranges, selected hit visibility, wrap markers, cursor/prompt/footer bounds, stale transcript rejection, no shell command spawn, no raw hyperlink leak, no pasteboard/network/external service/native input, AFK-safe flags, and cleanup.
For script output inspector folding recovery, prove fixture script run id, output fixtures, stream generation, stdout/stderr blocks, ANSI stack frames, JSON lines, progress rewrite generation, exit badge kind/bounds, filter text and highlight ranges, stderr fold before/after, stack expanded state, clear-filter restoration, dry-run retry receipt, no handler spawn or process kill, selection scroll anchor restoration, no interleave drift, stale output and wrong-run rejection, no pasteboard/network/external service/native input, AFK-safe flags, and cleanup.
For App Launcher icon-grid keyboard navigation, prove fixture app catalog identity, grid generation, visible app ids, icon bounds/fingerprints, selected app before/after, selected cell bounds/visibility, keyboard neighbor map, row/column count, filter generation, rendered/stripped query, empty state, preview panel, truncated-name tooltip, no text overlap or footer collision, Enter launch refusal, stale catalog and wrong-app rejection, no app launch/pasteboard/network/native input, AFK-safe flags, and cleanup.
For Browser History time-grouped privacy, prove fixture history provider identity, history generation, time bucket ids, sticky header bounds, visible visit ids/fingerprints, favicon fallback ids, redacted URL fingerprints, rendered/stripped query, selection before/after, duplicate collapse, no raw private URL leak, no favicon network request, browser activation refusal, stale history and wrong-visit rejection, no pasteboard/network/native input, AFK-safe flags, and cleanup.
For Settings preferences search/reset preview, prove sandbox config identity, preference section ids, visible preference ids, control bounds and accessible names, values before/preview, dirty preference ids, rendered/stripped query, search highlights, reset preview/cancel restoration, disabled-control refusal, no config write or secret leak, stale preference and wrong-preference rejection, no pasteboard/network/native input, AFK-safe flags, and cleanup.
For Settings read-only detail panel navigation, prove fixture catalog id, settings surface id, selected section before/after filtering, detail panel generation, visible row labels and bounds, text/detail/footer bounds, empty-state copy, disabled Apply/Save reason, unchanged config fingerprint, no setup/security prompt, stale detail and wrong-section rejection, no System Settings/TCC/config write/native input, AFK-safe flags, and cleanup.
For Design Picker preview restore visuals, prove fixture design catalog id, active design before preview, preview id/generation, theme token fingerprints before/preview/restore, visible picker rows/labels/bounds/text bounds, selected preview row visibility, screenshot-to-semantics target identity when capture is enabled, Escape/Cmd-W restore preview-only state, unchanged persisted design/config fingerprint, stale preview and wrong-surface rejection, no design/config write/native input, AFK-safe flags, and cleanup.
For Dictation History transcript preview redaction, prove fixture dictation store id, transcript row ids, transcript/query generations, selected transcript before/after filters, preview generation/source/render kind, visible preview text bounds, redacted transcript fingerprint, missing-audio fallback copy, emoji/grapheme bounds, footer/input non-overlap, no raw transcript/audio path leak, no microphone/media permission request, stale transcript and wrong-row rejection, no System Settings/TCC/native input, AFK-safe flags, and cleanup.

The Pattern

Every verification follows the same core loop:

1. Build Only What the Change Can Break

cargo build 2>&1 | tail -5

Only rebuild when the touched files can invalidate the binary or helper you need to exercise.

Docs, skills, notes, or source-audit-only changes: skip build.
Bun or shell harness changes with no Rust protocol changes: reuse the current debug binary if it already exists and a healthy session can start.
Rust or runtime changes: run cargo build.

If you do build, it must complete with Finished. If it fails, fix the build error first.

2. Reuse or Start a Session

# First look for a healthy reusable session
bun scripts/agentic/session-state.ts --list
bash scripts/agentic/session.sh status default

# Start or resume a named session — works from any shell
# session.sh waits for the APP_READY log marker instead of sleeping
SESSION_JSON="$(bash scripts/agentic/session.sh start default 2>/dev/null)"
APP_PID="$(printf '%s' "$SESSION_JSON" | jq -r '.pid')"
PIPE="$(printf '%s' "$SESSION_JSON" | jq -r '.pipe')"
LOG="$(printf '%s' "$SESSION_JSON" | jq -r '.log')"
READY="$(printf '%s' "$SESSION_JSON" | jq -r '.ready // false')"
READY_WAIT_MS="$(printf '%s' "$SESSION_JSON" | jq -r '.readyWaitMs // 0')"

# Fallback only if readiness marker was not observed.
if [ "$READY" != "true" ]; then
  sleep 0.5
fi

The session wrapper manages the named pipe, forwarder process, and PID tracking. Sessions are reusable across shells — no exec 3> / fd 3 trick required. session.sh start means the app is stdin-ready, not necessarily capture-ready. Prefer resume over cold start. A warm session plus state-only receipts should be the default path.

Session commands:

bash scripts/agentic/session.sh start [NAME]    # Create or resume (default: "default")
bash scripts/agentic/session.sh send NAME CMD    # Send JSON command
bash scripts/agentic/session.sh status [NAME]    # Check session state (JSON)
bash scripts/agentic/session.sh stop [NAME]      # Stop and clean up
bun scripts/agentic/session-state.ts --session NAME  # Detailed state report
bun scripts/agentic/session-state.ts --list          # List all sessions

All commands emit stable JSON envelopes on stdout (schemaVersion, status, payload). Diagnostics go to stderr.

start is idempotent — re-running it resumes an existing healthy session.

Alternative (legacy, single-shell only):

PIPE=$(mktemp -u)
mkfifo "$PIPE"
export SCRIPT_KIT_AI_LOG=1
./target/debug/script-kit-gpui < "$PIPE" > /tmp/sk-test.log 2>&1 &
APP_PID=$!
exec 3>"$PIPE"
sleep 3

3. Show the Window Only When Needed

# Session-based (any shell)
bash scripts/agentic/session.sh send default '{"type":"show"}'
sleep 1.5

The app starts hidden. State-only proofs should usually skip this step entirely.

Show the window only for screenshots, native input, or other proofs that require the real visible surface.

4. Interact

Send commands via the session. Common commands:

S="bash scripts/agentic/session.sh send default"

# Set filter text
$S '{"type":"setFilter","text":"search term"}'

# Read current state without touching focus
bash scripts/agentic/session.sh rpc default '{"type":"getState","requestId":"s1"}' --expect stateResult

# Discover visible elements (returns semantic IDs)
bash scripts/agentic/session.sh rpc default '{"type":"getElements","requestId":"e1"}' --expect elementsResult

# Discover an attached popup or detached surface directly by target
bash scripts/agentic/session.sh rpc default '{"type":"getElements","requestId":"e2","target":{"type":"kind","kind":"actionsDialog","index":0}}' --expect elementsResult

# Select element by semantic ID (from getElements response)
bash scripts/agentic/session.sh rpc default '{"type":"batch","requestId":"b1","commands":[{"type":"selectBySemanticId","semanticId":"choice:0:apple","submit":true}]}' --expect batchResult

# When supported, mutate popup state directly instead of typing through native focus
bash scripts/agentic/session.sh rpc default '{"type":"batch","requestId":"b2","target":{"type":"kind","kind":"actionsDialog","index":0},"commands":[{"type":"setInput","text":"alias"}]}' --expect batchResult

# Trigger a built-in view
$S '{"type":"triggerBuiltin","name":"clipboard"}'
$S '{"type":"triggerBuiltin","name":"tab-ai"}'
$S '{"type":"triggerBuiltin","name":"emoji"}'
$S '{"type":"triggerBuiltin","name":"apps"}'
$S '{"type":"triggerBuiltin","name":"file-search"}'

# Simulate keys (dispatches to current view; not suitable for interceptor bugs)
$S '{"type":"simulateKey","key":"enter","modifiers":[]}'
$S '{"type":"simulateKey","key":"escape","modifiers":[]}'
$S '{"type":"simulateKey","key":"k","modifiers":["cmd"]}'
$S '{"type":"simulateKey","key":"w","modifiers":["cmd"]}'

# Prefer GPUI event dispatch over simulateKey when you need the real key pipeline
bash scripts/agentic/session.sh rpc default '{"type":"simulateGpuiEvent","requestId":"g1","target":{"type":"main"},"event":{"type":"keyDown","key":"down","modifiers":[]}}' --expect simulateGpuiEventResult

# Type individual characters (for views with text input)
$S '{"type":"simulateKey","key":"h","modifiers":[]}'

# Query ACP state (returns input, cursor, picker, accepted item, thread status)
bash scripts/agentic/session.sh rpc default '{"type":"getAcpState","requestId":"acp1"}' --expect acpStateResult

5. Capture Screenshots

mkdir -p .test-screenshots
bash scripts/agentic/session.sh send default '{"type":"captureWindow","title":"","path":"'"$(pwd)"'/.test-screenshots/step-01.png"}'
sleep 1

title is substring match. "" matches any window.
For embedded ACP in the main Script Kit window, use title: "" or the resolver-driven verify-shot.ts / window.ts flow. Do not assume the title contains ACP Chat.
Path must be absolute — use $(pwd)/ prefix.
Runtime captureWindow does not allow arbitrary /tmp/*.png output paths.
Always sleep 1 after capture for file write.
The screenshot must come from the real runtime surface you are verifying, not a synthetic component window.
Read the PNG to visually verify. Never assume correctness without checking.

6. Read Logs

grep -i "keyword" /tmp/sk-test.log | head -20

Log format: TIMESTAMP|LEVEL|CATEGORY|cid=CORRELATION_ID message

7. Cleanup

# Session-based (preferred)
bash scripts/agentic/session.sh stop default

# Verify the session is actually gone before reporting success
bash scripts/agentic/session.sh status default

# Legacy fd 3 cleanup (single-shell only)
# exec 3>&-
# rm -f "$PIPE"
# kill $APP_PID 2>/dev/null || true
# wait $APP_PID 2>/dev/null || true

Cleanup is mandatory, even after failures or interrupted runs.

Do not report PASS or FAIL until the session you started has been stopped.
If you launched Script Kit via session.sh, run session.sh stop NAME and verify the session is no longer alive.
If you launched Script Kit directly, kill that specific PID and wait for it.
Do not leave orphan script-kit-gpui processes behind from agentic testing.

8. Report

PASS: build succeeded + expected screenshots match + expected log output + cleanup confirmed
FAIL: describe what went wrong with evidence (screenshot, log line), then still clean up the launched process/session

Timing Guidelines

Action	Wait strategy
App startup	`session.sh start` readiness wait; fallback 0.5s only if `ready=false`
Warm session reuse	Prefer 0-1s `status` / resume over creating a fresh process
State-only proof	Aim for 3-10s total; no screenshot or OS focus
`show` window	0.3s macOS focus-settling delay
`setFilter`	1s sleep or waitFor stateMatch
`triggerBuiltin` (opens new view)	waitFor appropriate condition
`simulateKey` (view transition)	1.5s sleep
`simulateKey` (text input)	0.1s sleep
`captureWindow`	1s sleep (file write)
ACP context bootstrap	`waitFor(acpReady, timeout=8000)`
ACP picker open	`waitFor(acpPickerOpen, timeout=3000)`
ACP picker accept	`waitFor(acpItemAccepted, timeout=3000)`
ACP response streaming	10-20s or waitFor(acpStatus)

Rule: Use waitFor for all ACP state transitions. Only use fixed sleeps for macOS focus-settling (0.3s) and file I/O (1s screenshot write).

Rule: Do not add a fixed sleep 3 after session.sh start. The session wrapper is responsible for readiness. Only use the 0.5s fallback when ready=false.

Rule: If a non-visual proof is trending beyond this budget, stop and redesign around getElements, getState, waitFor, batch, exact targets, or session reuse before escalating.

Session Management

Use scripts/agentic/session.sh instead of hand-rolling mkfifo + exec 3> in ad hoc shells.

Why: The exec 3>"$PIPE" pattern ties the pipe to a single shell process. When a coding agent spawns a new shell (e.g., follow-up verification step), fd 3 does not exist and the session is lost. The session wrapper uses a background forwarder process so any shell can send commands via session.sh send.

Rules:

Always use session.sh start instead of manual mkfifo + exec 3> for new verification flows
Use session.sh send for fire-and-forget stdin commands like show, triggerBuiltin, setFilter, and captureWindow
Use session.sh rpc for protocol requests that expect a typed response like getAcpState, getElements, waitFor, batch, and inspectAutomationWindow
Check session health with session.sh status or session-state.ts before sending commands
Stop sessions with session.sh stop when done — do not leave orphan processes
Treat cleanup as part of the test itself: a run is incomplete until the session is stopped and verified dead

Field Notes

These are practical lessons from real ACP verification runs in this repo.

If session.sh start reports a dead session even though the log reached STARTUP_READY, inspect the log before assuming the app crashed. In some debug runs the wrapper/forwarder dies while script-kit-gpui is still healthy. When that happens, switch to the legacy single-shell FIFO fallback so you can keep stdin open yourself.
window.ts and macos-input.ts --ensure-focus may fail against the debug binary because the process name is script-kit-gpui, not the bundled Script Kit app identity. If focus/capture helpers cannot find the app, use System Events targeting process "script-kit-gpui" directly.
For debug-only window capture, direct region screenshots via screencapture -R<x,y,w,h> can be more reliable than the bundle-oriented window resolver. Use runtime automation bounds or System Events window position/size to compute the region.
ACP pasted-text verification needs two deletes when the cursor is immediately after a newly inserted token: the first backspace removes the trailing space, the second removes the token atomically. Query getAcpState after each step so you do not misread a correct first delete as a failure.

Screenshot Assertion (verify-shot.ts)

Use verify-shot.ts for automated screenshot + state verification. It enforces the correct ACP verification order: state receipt first, screenshot second.

# Basic: capture screenshot with ACP state assertions
bun scripts/agentic/verify-shot.ts --session default \
  --label step-name \
  --acp-status idle \
  --acp-picker-closed \
  --acp-context-ready

# Assert picker is open after typing @
bun scripts/agentic/verify-shot.ts --session default \
  --label picker-open \
  --acp-picker-open

# Assert item was accepted after Enter/Tab
bun scripts/agentic/verify-shot.ts --session default \
  --label item-accepted \
  --acp-picker-closed \
  --acp-item-accepted

# State-only (skip screenshot)
bun scripts/agentic/verify-shot.ts --session default \
  --label quick-check \
  --skip-screenshot \
  --acp-input-contains "@context"

# Screenshot-only (skip state query)
bun scripts/agentic/verify-shot.ts --session default \
  --label visual-check \
  --skip-state

Available assertions:

Flag	Checks
`--acp-status STATUS`	ACP status equals value (idle, streaming, etc.)
`--acp-picker-open`	Picker overlay is visible
`--acp-picker-closed`	Picker overlay is closed
`--acp-input-contains STR`	Input text contains substring
`--acp-input-match STR`	Input text matches exactly
`--acp-cursor-at N`	Cursor at character index N
`--acp-item-accepted`	A picker item was accepted (lastAcceptedItem non-null)
`--acp-accepted-label STR`	lastAcceptedItem.label equals STR
`--acp-accepted-trigger STR`	lastAcceptedItem.trigger equals STR (@ or /)
`--acp-accepted-via KEY`	Probe confirms acceptance via enter or tab
`--acp-cursor-after-accepted N`	Probe confirms cursor landed at index N after acceptance
`--acp-context-ready`	Context bootstrap complete
`--acp-no-selection`	No text selection active (hasSelection is false)
`--acp-has-selection`	Text selection is active (hasSelection is true)
`--acp-no-permission`	No pending permission (hasPendingPermission is false)
`--acp-has-permission`	Pending permission present (hasPendingPermission is true)
`--acp-visible-start N`	inputLayout.visibleStart equals N (first visible char index)
`--acp-visible-end N`	inputLayout.visibleEnd equals N (last visible char index)
`--acp-cursor-in-window N`	inputLayout.cursorInWindow equals N (cursor position in viewport)

Proof bundle fields: The receipt includes stable top-level fields for machine consumption: state (ACP snapshot), probe (test probe snapshot), screenshot (path + capture metadata), captureTarget (requested vs actual window ID for identity proof), visionCrops (structured image check entries). These are the canonical fields for automated parsing.

Capture identity threading: Detached ACP screenshots use the inspected native osWindowId, not the automation window ID. When --target-json is present, verify-shot.ts auto-lifts inspection.osWindowId into the screenshot step. An explicit --capture-window-id is only an override and must match the inspected osWindowId. The receipt exposes captureTarget.requestedWindowId, captureTarget.actualWindowId, captureRouting, requestedAutomationWindowId, and inspectionOsWindowId.

Exit codes: 0 = pass, 1 = assertion failure, 2 = infrastructure error.

Canonical input-stability proof

Use visible-text-window assertions to verify single-line input rendering and cursor tracking without a screenshot:

bun scripts/agentic/verify-shot.ts --session default \
  --label input-stability \
  --skip-screenshot \
  --acp-visible-start 12 \
  --acp-visible-end 52 \
  --acp-cursor-in-window 39

This proves the cursor is within the visible window and the viewport bounds are stable, which catches scroll jumps, layout shifts, and cursor-out-of-view regressions.

Strict capture: When ACP assertions are present, verify-shot.ts requires window.ts quartz capture with frontmost confirmation and the exact inspected native window ID. If focus drifts, the inspected osWindowId is missing, or the captured windowId differs from the requested ID, the run fails instead of silently falling back to a full-screen screenshot.

Rule: The recipe must fail when ACP state contradicts expected picker/caret outcome, even if the screenshot capture itself succeeds. State receipt is the primary proof; screenshot is secondary visual confirmation.

Recipe Orchestrator (index.ts) — Preferred ACP Verification

Always prefer the canonical CLI over ad hoc shell sequences. The orchestrator encodes the correct verification order, focus enforcement, probe resets, and checkpoint strategy so agents do not need to reconstruct these from scratch.

Default Surface Proof (Preferred)

Use surface-proof as the default seconds-first proof command for an already-open product surface. For main-hosted surfaces, enter through the real runtime command and keep the proof state-first.

bash scripts/agentic/session.sh start default
bash scripts/agentic/session.sh send default '{"type":"triggerBuiltin","name":"clipboard"}' --await-parse
bun scripts/agentic/index.ts surface-proof --session default --kind main
bash scripts/agentic/session.sh stop default
bash scripts/agentic/session.sh status default

# Advanced exact-target proofs when a popup or detached surface already exists:
bun scripts/agentic/index.ts surface-proof --session default --kind promptPopup --index 0
bun scripts/agentic/index.ts surface-proof --session default --kind acpDetached --index 0

This path reuses a warm session, promotes the target through automation-window.ts inspect, samples getState and getElements, returns a machine-readable proof bundle, and does not call show, native input, or screenshot capture unless the proof explicitly needs that escalation.

Sample output shape:

{
  "schemaVersion": 1,
  "recipe": "surface-proof",
  "status": "pass",
  "summary": "State-first main proof succeeded for main",
  "proofBundle": {
    "schemaVersion": 2,
    "scenario": "main-window-exact-id",
    "surfaceClass": "main",
    "resolvedTarget": {
      "windowId": "main",
      "windowKind": "Main"
    },
    "targetIdentity": { "stable": true },
    "usage": {
      "stateFirst": true,
      "usedGetState": true,
      "usedGetElements": true,
      "usedScreenshot": false,
      "usedNativeInput": false,
      "usedShow": false,
      "usedFixedSleepMs": 0
    },
    "capabilities": {
      "state": true,
      "elements": true,
      "nativeInputRequired": false,
      "screenshotRequired": false
    },
    "state": { "type": "stateResult" },
    "elements": { "type": "elementsResult" },
    "warnings": []
  }
}

Canonical ACP proof commands

# Full ACP picker accept — choose key with --key enter|tab
bun scripts/agentic/index.ts acp-accept --session default --key enter
bun scripts/agentic/index.ts acp-accept --session default --key tab --vision

# Target a specific ACP window (detached/popup) — resolve exact identity first
RESOLVED="$(bun scripts/agentic/automation-window.ts resolve --session default --kind acpDetached --index 0)"
TARGET="$(printf '%s' "$RESOLVED" | jq -c '.targetJson')"
SURFACE_ID="$(printf '%s' "$RESOLVED" | jq -r '.surfaceId')"
bun scripts/agentic/index.ts acp-accept --session default --key enter \
  --target-json "$TARGET" --surface "$SURFACE_ID" --vision

Target threading (non-negotiable for multi-window ACP)

When verifying a detached or popup ACP window, resolve one target once and reuse it for every RPC and native input step in the entire run.

Canonical rule:

Discover the surface (e.g., bun scripts/agentic/window.ts list).
Pick one --target-json object (e.g., {"type":"kind","kind":"acpDetached","index":0}).
Pass that same target to every ACP RPC: getAcpState, getAcpTestProbe, resetAcpTestProbe, waitFor, and batch.
Pass the matching --surface value to native input so focus and proof stay on the same window.
Never mix focused-window ACP RPCs with surface-targeted native input in the same verification run. This causes cross-window false proof where you drive one ACP surface and verify another.

The --target-json flag threads through index.ts → verify-shot.ts → every RPC command, and the --surface flag threads through index.ts → macos-input.ts → window.ts for focus enforcement.

When --target-json is omitted, RPCs default to the main ACP view (existing behavior).

What acp-accept guarantees:

Resets ACP test probe before native interaction (no stale accepted items)
Uses macos-input.ts --ensure-focus for native typing and acceptance
Uses state-only checks for ACP-ready and picker-open (no intermediate screenshots)
Waits for acpAcceptedViaKey (key-specific proof, not generic acpItemAccepted)
Keeps exactly one final screenshot as visual proof
Emits vision crops only when --vision is requested
When --vision is used, surfaces the full proof bundle (with state, probe, screenshot, visionCrops) as proofBundle in the recipe receipt

Other recipes

# Check all prerequisites
bun scripts/agentic/index.ts preflight --session default

# Open ACP and verify ready state (state-only, no screenshot)
bun scripts/agentic/index.ts acp-open --session default

# Compatibility aliases (same as --key enter / --key tab)
bun scripts/agentic/index.ts acp-enter-accept --session default
bun scripts/agentic/index.ts acp-tab-accept --session default

# Hard-scenario recipes
bun scripts/agentic/index.ts acp-detached-target-threading-stress \
  --session default --kind acpDetached --index 0 --min-targets 2 --key enter --vision --json
bun scripts/agentic/index.ts acp-prompt-popup-parity \
  --session default --families mention,model-selector,local-history --json
bun scripts/agentic/index.ts notes-acp-delayed-action-origin-stress \
  --session default --drift generation --json
bun scripts/agentic/index.ts file-portal-origin-roundtrip \
  --session default --origin acp --portal file-search --selection file --query AGENTS.md --json
bun scripts/agentic/index.ts permission-privacy-preflight \
  --session default --kinds accessibility,screen-recording,microphone --json
bun scripts/agentic/index.ts shortcut-recorder-focus-capture \
  --session default --surface shortcuts --action test-agentic-shortcut --chord cmd+shift+7 --sandbox-config --json
bun scripts/agentic/index.ts template-prompt-automation-parity-stress \
  --session default --template 'Hello {{name}}' --field name --value Ada --forced-value forced-template-result --json
bun scripts/agentic/index.ts current-app-commands-frontmost-stress \
  --session default --alias 'Do in Current Command' --query 'close tab' --json
bun scripts/agentic/index.ts actions-captured-subject-frame-stress \
  --session default --source root-file --action quick-look --mutation filter-selection-cache-frame --json
bun scripts/agentic/index.ts drop-prompt-native-drop-privacy-stress \
  --session default --file-name agentic-drop.txt --size 12 --json
bun scripts/agentic/index.ts path-prompt-filesystem-edge-stress \
  --session default --json
bun scripts/agentic/index.ts screenshot-identity-acp-context-stress \
  --session default --source tab-ai-screenshot --json
bun scripts/agentic/index.ts clipboard-history-portal-range-stress \
  --session default --portal-id 'kit://clipboard-history?id=agentic' --range composer:0..0 --json
bun scripts/agentic/index.ts browser-tabs-cache-identity-stress \
  --session default --source browser-tabs --json
bun scripts/agentic/index.ts scroll-selection-reanchor-stress \
  --session default --kinds clipboard,browser-history,current-app-commands,file-search --json
bun scripts/agentic/index.ts accessibility-tree-semantic-parity-stress \
  --session default --surfaces main,actionsDialog,promptPopup --json
bun scripts/agentic/index.ts rtl-bidi-emoji-text-rendering-stress \
  --session default --surface acp-composer --text 'abc שלום 👩🏽‍💻 é مرحبا 123' --json
bun scripts/agentic/index.ts high-volume-virtualized-list-stability-stress \
  --session default --surface clipboard-history --fixture-count 5000 --filter-cycles 8 --scroll-cycles 12 --json
bun scripts/agentic/index.ts input-modality-transition-ownership-stress \
  --session default --surface main --interleave pointer-hover,keyboard-nav,trackpad-scroll,wheel-scroll,shortcut --cycles 8 --json
bun scripts/agentic/index.ts multi-context-attachment-dedupe-provenance-stress \
  --session default --origins file,screenshot,selected-text,mcp-resource,clipboard-snippet --destinations acp-composer,notes --reorder-cycles 3 --json
bun scripts/agentic/index.ts visual-contrast-readable-state-stress \
  --session default --surfaces main,actionsDialog,promptPopup,acp-composer,notes --themes light,dark --scale-factors 1,1.25,1.5 --states active,inactive,disabled,focused,error,loading --json
bun scripts/agentic/index.ts empty-error-retry-state-ux-stress \
  --session default --surfaces main,clipboard-history,emoji-picker,file-search --query 'agentic-loop-eighteen-no-results-zzzz' --json
bun scripts/agentic/index.ts form-validation-inline-recovery-stress \
  --session default --surface fields-prompt --fields email,required-text,number --invalid email:not-an-email,required-text:,number:not-a-number --valid email:ada@example.com,required-text:Ada,number:42 --json
bun scripts/agentic/index.ts navigation-back-stack-history-stress \
  --session default --origin main --surfaces clipboard-history,emoji-picker,file-search,actionsDialog --transitions triggerBuiltin,cmd-k,escape,back --json

State-only vs screenshot checkpoints

Checkpoint	Screenshot?	Probe?	Why
ACP ready	No	No	`waitFor(acpReady)` is sufficient proof; screenshot is waste
Picker open	No	No	`waitFor(acpPickerOpen)` is sufficient proof
Final accepted	Yes	Yes	The only checkpoint that needs visual + probe evidence

Rule: Intermediate checkpoints use state-only verification (--skip-screenshot --skip-probe). Only the final acceptance step captures a screenshot and queries the probe.

Receipt shape

Each recipe returns a machine-readable JSON receipt:

{
  "schemaVersion": 1,
  "recipe": "acp-enter-accept",
  "status": "pass",
  "steps": [
    { "name": "acp-open", "status": "pass" },
    { "name": "reset-probe", "status": "pass" },
    { "name": "type-at-trigger", "status": "pass" },
    { "name": "wait-accepted-via-key", "status": "pass" },
    { "name": "verify-accepted", "status": "pass" }
  ]
}

When --vision is used, a proofBundle field is added containing the verify-shot receipt with state, probe, screenshot, and visionCrops for direct machine consumption.

The wrapper does not replace the lower-level commands — use session.sh, macos-input.ts, window.ts, and verify-shot.ts directly when you need finer control.

ACP Golden Path (Critical)

The mandatory verification flow for any ACP interaction testing. Prefer the canonical CLI (bun scripts/agentic/index.ts acp-accept) over reconstructing the manual steps below.

Canonical (one command, fully non-interactive)

bash scripts/agentic/session.sh start default
bun scripts/agentic/index.ts acp-accept --session default --key enter --vision
# The recipe returns a machine-readable JSON receipt with proofBundle.
# Parse proofBundle.state, proofBundle.probe, proofBundle.screenshot, proofBundle.visionCrops
# to verify ACP behavior programmatically, then read the written PNG for final visual confirmation.
bash scripts/agentic/session.sh stop default

Exact detached ACP proof (preferred)

The scenario recipe resolves one exact detached ACP target once, reuses the exact targetJson for every subsequent step, and emits a structured proof bundle recording windowId, surfaceId, and ordered step receipts.

bash scripts/agentic/session.sh start default
bun scripts/agentic/index.ts scenario \
  --session default \
  --scenario detached-acp-exact-id \
  --index 0
bash scripts/agentic/session.sh stop default

The proof bundle is the regression substrate — every step records the exact target used, the full request/response pair, and a timestamp. Exit code 0 means all steps succeeded; exit code 1 means some steps produced warnings.

Canonical with target threading (detached/popup ACP)

For finer-grained control (e.g., picker acceptance flows), resolve one exact ACP target once and reuse both the target and surfaceId for the full run. Do not use loose family-level --surface acp — use the exact surfaceId from the resolver.

bash scripts/agentic/session.sh start default

# Resolve exact target and surface identity once
RESOLVED="$(bun scripts/agentic/automation-window.ts resolve --session default --kind acpDetached --index 0)"
TARGET="$(printf '%s' "$RESOLVED" | jq -c '.targetJson')"
SURFACE_ID="$(printf '%s' "$RESOLVED" | jq -r '.surfaceId')"

bun scripts/agentic/index.ts acp-accept --session default --key enter \
  --target-json "$TARGET" --surface "$SURFACE_ID" --vision
INSPECTED="$(bun scripts/agentic/automation-window.ts inspect --session default --id "$(printf '%s' "$RESOLVED" | jq -r '.automationWindowId')")"
WINDOW_ID="$(printf '%s' "$INSPECTED" | jq -r '.osWindowId')"

bun scripts/agentic/index.ts acp-accept --session default --key enter \
  --target-json "$TARGET" --surface "$SURFACE_ID" --vision
bun scripts/agentic/verify-shot.ts --session default --label detached-proof \
  --target-json "$TARGET" --capture-window-id "$WINDOW_ID"
# Confirm proofBundle.state.resolvedTarget.windowKind == "acpDetached"
# Confirm captureTarget.requestedWindowId == captureTarget.actualWindowId
bash scripts/agentic/session.sh stop default

The --vision flag makes the recipe return a proofBundle containing all machine-readable proof. The golden path is complete when the exit code is 0 and the proofBundle.state and proofBundle.probe fields confirm the expected ACP state. Screenshot files are still written for archival but are not the primary verification mechanism.

Manual (when you need finer control)

1. session start                               → session alive
2. show                                        → window visible
3. triggerBuiltin tab-ai                       → ACP opens
4. waitFor(acpReady, timeout=8000)             → context bootstrapped (deterministic)
5. focus window                                → frontmost confirmed
6. native type @ (macos-input.ts --ensure-focus) → open picker
7. waitFor(acpPickerOpen, timeout=3000)        → picker open (deterministic)
8. native Enter or Tab (macos-input.ts --ensure-focus) → accept picker item
9. waitFor(acpAcceptedViaKey, timeout=3000)    → key-specific acceptance (deterministic)
10. verify-shot.ts with --acp-accepted-via     → state + probe + screenshot proof

Key tools in the golden path:

Tool	Role
`session.sh`	Cross-shell session management, RPC, lifecycle
`macos-input.ts`	Native macOS keyboard/mouse with `--ensure-focus`
`window.ts`	Window discovery, focus, activation, quartz capture
`verify-shot.ts`	State + probe + screenshot bundle with strict capture
`automation-window.ts`	Exact ACP target-to-surface resolver
`scenario.ts`	Replayable proof-bundle scenarios for cross-window regression
`index.ts`	Orchestrator that composes all of the above correctly

waitFor replaces fixed sleeps. Each waitFor polls at 25ms intervals and returns a waitForResult receipt with success, elapsed, and an optional trace (included automatically on failure when trace: "onFailure").

State receipt before screenshot is non-negotiable. If the state says the picker is still open but the screenshot looks fine, the test must FAIL.

Any remaining sleeps in the recipes are brief macOS focus-settling delays (~300ms) with explicit comments. They are not proof of ACP state.

Verification Recipes

See references/recipes.md for named verification patterns.

Other hard-scenario recipes:

bun scripts/agentic/index.ts long-text-wrap-resize-surface-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog --widths mini,narrow,full --fixtures long-name,long-path,long-description,multiline-snippet --json
bun scripts/agentic/index.ts actions-command-discoverability-noop-stress --session default --hosts main,clipboard-history,emoji-picker,file-search,app-launcher --states actionable,disabled,no-op --json
bun scripts/agentic/index.ts dense-list-detail-preview-readability-stress --session default --surfaces file-search,sdk-reference,script-template-catalog --query agentic-loop-nineteen-preview --filter-cycles 4 --selection-cycles 8 --resize-cycles 3 --json
bun scripts/agentic/index.ts toast-notification-queue-lifecycle-stress --session default --surface main --fixtures success,duplicate,persistent,dismiss,autohide --cycles 3 --json
bun scripts/agentic/index.ts destructive-confirm-modal-safety-stress --session default --host main --fixture agentic-destructive-dry-run --paths cancel,confirm,stale-confirm --dry-run-only --json
bun scripts/agentic/index.ts loading-skeleton-progress-restoration-stress --session default --surfaces sdk-reference,script-template-catalog --fixture delayed-local --cycles 4 --json
bun scripts/agentic/index.ts icon-image-fallback-redaction-stress --session default --surfaces app-launcher,file-search,clipboard-history --fixtures missing-file,corrupt-png,private-local-path,data-uri-redacted --json
bun scripts/agentic/index.ts footer-status-persistence-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog --transitions filter,selection,cmd-k,escape,clear-filter --json
bun scripts/agentic/index.ts keyboard-hint-label-parity-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog,menuSyntaxTriggerPopup --families footer,row-accessory,tooltip,action-catalog --json
bun scripts/agentic/index.ts row-state-parity-without-pointer-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog --states selected,focused,hovered,selected-hovered --json
bun scripts/agentic/index.ts quiet-chrome-card-nesting-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog,promptPopup --chrome quiet --json
bun scripts/agentic/index.ts scroll-shadow-sticky-header-density-stress --session default --surfaces clipboard-history,emoji-picker,file-search,app-launcher,actionsDialog --scroll-positions top,middle,bottom --density compact,default --json
bun scripts/agentic/index.ts popup-focus-keycap-visual-semantics-stress --session default --surfaces actionsDialog,menuSyntaxTriggerPopup,confirmPrompt --json
bun scripts/agentic/index.ts reduced-motion-animation-disable-stress --session default --surfaces main,actionsDialog,menuSyntaxTriggerPopup --fixture reduced-motion --json
bun scripts/agentic/index.ts command-search-highlighting-accessory-badges-stress --session default --hosts main,actionsDialog,app-launcher,menuSyntaxTriggerPopup --query agentic-loop-twenty-three --json
bun scripts/agentic/index.ts clipboard-copy-visual-feedback-stress --session default --hosts file-search,actionsDialog,app-launcher --fixture agentic-copy-preview --pasteboard-scope fixture --no-system-pasteboard --json
bun scripts/agentic/index.ts portal-cancel-return-state-restoration-stress --session default --origins acp-composer,notes --portal file-search --query AGENTS.md --cancel-methods escape,back --fixture repo-file --no-native-picker --json
bun scripts/agentic/index.ts tooltip-hover-focus-affordance-stress --session default --surfaces main,actionsDialog,app-launcher --targets truncated-row,disabled-action,footer-button --fixture agentic-tooltips --input-modes protocol-hover,keyboard-focus --no-native-pointer --json
bun scripts/agentic/index.ts shortcut-recorder-cancel-layering-stress --session default --surface shortcuts --action test-agentic-shortcut --cancel-methods escape,cmd-w,backdrop,parent-click --input-modes protocol-key,protocol-click --sandbox-config --no-config-write --json
bun scripts/agentic/index.ts inline-popover-anchor-resize-stress --session default --families acp-slash,acp-mention,menu-syntax-colon --widths mini,narrow,full --fixture agentic-inline-popover --input-modes protocol-key,protocol-resize --no-native-input --json
bun scripts/agentic/index.ts disabled-footer-hit-target-refusal-stress --session default --surfaces drop-prompt,fields-prompt,path-prompt --fixtures empty-drop,invalid-fields,missing-path --input-modes enter,footer-shortcut,protocol-footer-click --no-native-pointer --dry-run-only --json
bun scripts/agentic/index.ts mini-full-transition-layout-continuity-stress --session default --surfaces main,mini-prompt,fields-prompt,actionsDialog --transitions mini-to-full,full-to-mini,hide-show,return-to-origin --fixture agentic-mini-full-layout --input-modes protocol-key,protocol-resize --no-native-input --no-native-pointer --no-system-pasteboard --local-fixture-only --json
bun scripts/agentic/index.ts filter-input-decoration-chip-layout-stress --session default --surfaces main --queries 'f: AGENTS.md,c: agentic,~/script,:actions,;note,!command,literal\\:chip' --widths mini,narrow,full --scale-factors 1,1.25,1.5 --fixture agentic-filter-input-decorations --input-modes protocol-set-filter,protocol-resize --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --local-fixture-only --json
bun scripts/agentic/index.ts focus-ring-viewport-integrity-stress --session default --surfaces main,actionsDialog,fields-prompt,path-prompt --fixture agentic-focus-rings --input-modes protocol-key,simulate-gpui-event --steps tab,shift-tab,up,down,escape --no-native-input --no-native-pointer --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts warning-banner-action-dismiss-semantics-stress --session default --surface main --fixtures warning,actionable,dismissible,error --input-modes protocol-hover,protocol-click,protocol-key --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --local-fixture-only --json
bun scripts/agentic/index.ts select-prompt-multiselect-keyboard-state-stress --session default --surface select-prompt --fixture agentic-multiselect --choices 24 --selection-steps space,cmd-a,filter-preserve,clear-filter,range-toggle,escape-restore --input-modes protocol-key,batch --no-native-input --no-native-pointer --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts file-search-preview-sanitization-stress --session default --surface file-search --fixture agentic-safe-preview --preview-fixtures text,binary,large-text,missing-file,private-path,unsupported-kind --selection-cycles 8 --filter-cycles 4 --input-modes protocol-set-filter,protocol-key,batch --no-native-input --no-native-pointer --no-native-picker --no-quick-look --no-system-pasteboard --local-fixture-only --json
bun scripts/agentic/index.ts hotkey-prompt-transient-capture-cancel-stress --session default --surface hotkey-prompt --fixture agentic-transient-hotkey --chords cmd+shift+7,ctrl+space --cancel-methods escape,cmd-w --input-modes protocol-key,simulate-key --no-native-input --no-native-pointer --no-config-write --no-global-hotkey-registration --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts process-manager-sort-detail-panel-stability-stress --session default --surface process-manager --fixture agentic-process-table --sort-keys name,cpu,memory,pid --selection-cycles 8 --filter-cycles 4 --input-modes protocol-click,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-process-kill --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts env-prompt-redacted-status-error-recovery-stress --session default --surface env-prompt --fixture agentic-env-status --status-fixtures missing-secret,parse-error,masked-existing,valid-edit --input-modes protocol-set-input,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --no-secret-write --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts command-palette-breadcrumb-route-stack-stress --session default --host main --fixture agentic-actions-breadcrumbs --drill-path parent-action,child-action --filter 'switch' --back-methods escape,breadcrumb-click --input-modes protocol-key,protocol-click,batch --no-native-input --no-native-pointer --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts root-source-chip-action-semantics-stress --session default --queries 'f: AGENTS.md,c: agentic,n: welcome,-c: noise' --actions remove-chip,clear-all,toggle-exclude,open-chip-actions --input-modes protocol-click,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts recent-history-dedupe-root-grouping-stress --session default --fixture agentic-root-recents --sources files,notes,clipboard,dictation,acp-history --query agentic-loop-29-dupe --cycles 6 --input-modes protocol-set-filter,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts inline-attachment-preview-chip-stability-stress --session default --hosts acp-composer,notes --fixture agentic-inline-attachments --origins local-file,fixture-image,fixture-text,script-resource --chip-actions focus,preview,remove,reorder,overflow --input-modes protocol-set-input,protocol-click,batch --no-native-input --no-native-pointer --no-native-picker --no-screen-capture --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts window-title-status-semantics-stress --session default --surfaces main,acp-composer,actionsDialog,promptPopup,notes --states idle,busy,error,dirty,ready --transitions triggerBuiltin,cmd-k,escape,hide-show --input-modes protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts menu-syntax-capture-validation-chip-stress --session default --fixture agentic-capture-validation --cases missing-body-date,missing-date,ready,malformed-url,unresolved-date,dynamic-schema --input-modes protocol-set-filter,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts acp-footer-activity-indicator-stress --session default --hosts acp-composer,notes --fixture agentic-acp-footer-activity --activity-fixtures context-capture,tool-call,plan-update,permission-wait,cancelled,idle-recovered --input-modes protocol-state,batch --agent-fixture scripted-local --no-native-input --no-native-pointer --no-security-prompts --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts acp-model-history-popover-visual-state-stress --session default --families model-selector,local-history --fixture agentic-acp-popover-visual-state --states idle,filtered,empty,loading,current-selection,error-recovered --selection-cycles 8 --filter-cycles 4 --input-modes protocol-set-input,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts acp-context-insertion-preview-parity-stress --session default --sources file-search,browser-history,dictation-history,notes --destination acp-composer --fixture agentic-context-preview-parity --selection-cycles 6 --filter-cycles 4 --insert-modes protocol-accept,batch --input-modes protocol-set-filter,protocol-key,batch --no-native-input --no-native-pointer --no-native-picker --no-quick-look --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json

Adjacent Skills

Use adjacent skills when the work crosses boundaries:

$testing-quality-gates for choosing narrow build/test gates.
$protocol-automation when stdin JSON, receipts, target identity, waitFor, or batch are the behavior owner.
The domain skill for the active surface, such as $acp-chat-core, $actions-popups, $keyboard-focus-routing, $launcher-surface-contracts, or $window-resizing.

Migration Notes

Key Gotchas

simulateKey does NOT go through GPUI's intercept_keystrokes(). Use triggerBuiltin for ACP Chat entry, not simulateKey Tab.
AcpChatView accepts single-char simulateKey for typing, enter for submit, w+cmd for close.
Attached popups like ActionsDialog and PromptPopup can expose targeted state snapshots even when they do not expose an independent GPUI key handle. Read state from the popup target first; only escalate to parent-window or native input when you must drive real key delivery.
simulateGpuiEvent is better than simulateKey for interceptor bugs, but handle_unavailable means the target has no usable runtime handle for that path. Treat that as a proof-design problem, not a cue to spam retries.
The app window auto-hides when focus is lost. If captures fail with "Window not found", the window was dismissed.
captureWindow filters out windows under 100x100 (tray icons).
Always unset API keys if you need the setup card: unset ANTHROPIC_API_KEY.
For ACP picker testing, use native macOS input (macos-input.ts --ensure-focus) instead of simulateKey — synthetic keys bypass GPUI's native key interception and do not faithfully exercise picker selection behavior.
Use getAcpState to verify picker acceptance, cursor landing, and input content — do not rely solely on screenshots for ACP state verification.
Use waitFor commands via session.sh rpc for deterministic ACP state transitions — do not use fixed sleeps as proof of ACP state.

name	agentic-testing
description	Human-first runtime testing for Script Kit GPUI: operate the real app through visible user paths to surface UX/UI interaction bugs, then back findings with receipts, screenshots, exact targets, and cleanup.

Agentic Testing

Human-First Purpose

Never let "state-first" become "user-last." State is the witness, not the experience. The experience is the real app being driven through the user's path so bugs actually reveal themselves.

Canonical Ownership

Primary paths and concepts:

scripts/agentic/, .test-output/, .test-screenshots/
Human-like runtime exploration, exact automation targets, screenshots, native input escalation, and cleanup
State-first receipts from getState, getElements, inspectAutomationWindow, waitFor, batch, ACP state/probe APIs, and recipe receipts

First Reads

Start with these sources before editing or proving behavior:

.agents/subagents/agentic-testing-reader.md for broad or high-risk investigation.

Workflow

Review AGENTS.md, the owning skill, and current source context before editing.
Identify the behavior owner before editing shared files. Path ownership is a hint; the user-visible behavior and documented contract decide the owner.
Check adjacent-skill boundaries before changing shared code.
Make the narrowest change that preserves the domain invariant.
For user-visible UX/UI reports, reproduce the real visible workflow first; then verify with the smallest receipt-backed proof that can fail if the behavior regresses.
Report changed files, proof tier, exact commands or receipts, adjacent skills consulted, cleanup status, and remaining risk.

When to Use

After implementing any UI, protocol, or behavior change
For user-reported interaction, UX, or UI bugs where the app should be opened and exercised like a user would
For routine UI and behavior work, this is the default smoke test
When Oracle's autonomous verification says "Run the agentic-testing skill"
Before marking a task as complete
Especially after changes to: prompts, views, keyboard handlers, ACP chat, actions dialog

Human-First, Receipts-Backed Default

No-runtime proof: docs, skills, source audits, or focused tests only. Do not launch the app if runtime evidence is unnecessary.
Visible user-path proof: for user-reported UX/UI/interaction bugs, open or reveal the real app, enter through the same product path, perform the user action visibly, and capture at least one screenshot or equivalent visual receipt when the bug is about what the user sees.
State-first runtime proof: reuse a warm session and prove behavior with getElements, getState, waitFor, batch, and exact automation targets. This backs the visible workflow and is the default for routing, selection, focus, popup ownership, and protocol bugs.
Visual diagnostics proof: capture screenshots when layout, styling, visibility, animation, or real-shell composition is part of the acceptance criteria, and tie those pixels back to semantic receipts.
Native input and focus enforcement: use when protocol-level and GPUI-level paths cannot exercise the real user bug or when the report is specifically about real keyboard, pointer, AppKit focus, or OS delivery.

Rules:

If the proof does not drive the actual app surface through a user-recognizable interaction path, it has not tested a UX/UI bug. It has only inspected implementation state.
If the user report is about visible UI or interaction and your plan never opens/reveals the real app, stop and redesign the proof around the user's path.
If your plan starts with cold start -> show -> screenshot -> log scrape, add state/elements receipts so the screenshot is grounded in the correct surface.
Reuse an existing healthy session before starting a new one.
Avoid stealing OS focus unless the bug specifically involves native focus or the proof requires real keyboard or mouse delivery.
Prefer exact target threading and targeted receipts over reading generic global state.
If a non-visual proof is taking longer than about 10 seconds, redesign the proof before continuing.

Safety Rules (MANDATORY)

NEVER delete files, directories, or data
NEVER modify databases, user data, or production state
NEVER run destructive commands (rm -rf, DROP, git push --force, git reset --hard)
NEVER send requests to external services, APIs, or webhooks
NEVER modify files outside the project directory
NEVER commit, push, or modify git history
ALL verification is read-only: build, launch, screenshot, grep, read logs
Temp pipes/logs may live under /tmp via the session wrapper. Runtime captureWindow screenshots must go in project .test-screenshots/ / test-screenshots/ or ~/.scriptkit/screenshots
The app runs locally only — never connect to production
Every verification run MUST stop every Script Kit process/session it started before reporting results

AFK-Safe Proof Gate

User Bug Triage

Turn a user-filed UI/UX bug into a proof plan before choosing or adding a recipe:

Surface and real entry path.
Visible user workflow, including the app actions a person would take.
Observable invariant the user would notice.
Minimal fixture and relevant state generation.
Unsafe operations to simulate rather than perform.
Screenshot, visual receipt, or explicit reason visual capture is unnecessary.
Expected receipt fields and cleanup proof.
Pass/fail oracle, including what counts as reproduced, fixed, blocked, or inconclusive.

Stop Rule

Forbidden-State Simulation

Bug Result Schema

Surface Identity Rules (MANDATORY)

Always verify the real user-facing surface through its real runtime entry path first.
For Script Kit UI, prefer stdin JSON commands, built-in routing, and real app windows over ad hoc component harnesses.
Never treat an isolated GPUI entity, temporary debug window, story, off-screen render, or synthetic wrapper as proof of a real product surface unless the user explicitly asks for component-level verification.
Before trusting a screenshot, confirm the captured surface matches the intended product surface:
- same entry path
- same window/shell
- same wrapper/root chrome
- same footer, sizing, and layout structure
If the screenshot does not clearly match the real surface, stop and re-route verification to the real surface instead of iterating on the fake one.
For ACP specifically, AcpChatView in isolation is not sufficient proof. Default to the real ACP entry path (triggerBuiltin tab-ai, detached chat window routing, or another production runtime path) before using any synthetic ACP harness.

Visual Diagnostics

Use screenshot-semantics-visual-consistency-stress for pass-now visual consistency. It checks strict capture identity, non-blank content audit, getState, getElements, selected row, focus receipt, footer actions, popup crop bounds, and semantic visible text labels.
visibleTextMode:"semanticElements" means the harness found visible text from automation element labels. It is not OCR and not clipping proof.
For visible text, require text bounds, rendered text bounds, measured width, available width, glyph/container bounds, overlap pairs, and truncation metadata. Clipped or ellipsized text is acceptable only when the receipt proves intentional truncation plus tooltip or accessible full text.
For layout measurement, require rem/font/scale metrics, window/content/container/scroll/input/footer bounds, footer/input ownership, and before/after layout-shift fingerprints for filtering or resizing.
For screenshot-to-semantics checks, require the screenshot crop target to match the exact automation window and semantic surface, then cross-check selected row, focus ring, footer actions, and visible text against getElements receipts.
Do not treat pixels alone as proof. A PNG can show that something rendered, but it does not prove the selected row, focus target, visible text, or footer actions are the correct semantic objects unless the receipt ties pixels back to the same target window.
Do not claim text fits from a screenshot alone. Use visible-text-clipping-overlap-stress for clipping and overlap audits; it opens the real main window and combines getElements, getLayoutInfo, and AppKit text measurement to report text bounds, measured width, available width, clipping state, truncation intent, tooltip or accessible full text, overlap pairs, and cleanup.
Do not claim rem/layout correctness from window bounds alone. Use layout-measurement-regression-stress; it opens the real main window, records getLayoutInfo component bounds plus bounded getElements semantics, then drives setFilter and reset to prove filter-churn layout fingerprints stay stable. Treat non-main surface warnings as remaining coverage gaps, not proof.
Do not claim dynamic choice sizing works from source constants alone. Use main-menu-dynamic-choice-resize-stress; it opens real small and large choice prompts in one app session, compares visibleChoiceCount against the fixture counts, measures getWindowBounds before/after, requires height growth with stable width, and cleans up through Escape.
Do not claim Notes auto-resize works from source constants alone. Use notes-window-resize-stress; it opens the real Notes window against a sandbox notes DB, drives targeted Notes batch.setInput through the editor path, measures Notes window bounds before tall content, after tall content, and after short content, and requires height growth, height shrink, stable width, and cleanup.
Do not claim a tall div or container is scroll-safe from a screenshot alone. Use div-container-scroll-overflow-stress; it opens a real DivPrompt, requires a DivContent layout component instead of launcher ScriptList/PreviewPanel components, estimates fixture content height against the viewport, proves scrolling is required, checks cleanup, and warns until div scroll position is exposed as a first-class receipt.
Do not claim contrast/readability from screenshots alone. Use visual-contrast-readable-state-stress; it opens the real main window for visible semantics and collects AGENTIC_THEME_CONTRAST_RECEIPT foreground/background token samples with contrast ratios, minimum ratios, pass flags, theme fingerprints, and cleanup.
For broad exploratory UX coverage, use bun scripts/agentic/user-story-audit.ts --limit 100 --max-ms 60000. It converts existing stress recipes into user-shaped stories, skips stories already exercised in the current thread by default, writes .test-output/agentic-100-user-story-audit-*.json, and separates pass, fail_closed, blocked_precondition, runtime_failure, and timeout. Treat fail-closed results as missing proof/backlog, never as a UI pass.

Hard Interaction Boundaries

When a user flow spans stacked modals, cross-surface export, or app restart recovery, require one receipt that proves ownership boundaries before sending input.

For a modal stack, prove the topmost owner before each Escape, Cmd-W, or Enter action, then prove the child closed or executed without mutating the parent selection/focus unless that parent was the target.
For cross-surface export provenance, prove the payload origin surface, generation, selected semantic id, redacted preview, destination identity, stale-source rejection, and cleanup. Clipboard or drag side effects alone are not proof.
Restart/recovery recipes must gate every promoted target with a session epoch. If the epoch changes, the harness must refuse native, batch, and GPUI input before delivery, then re-resolve the exact target.
For stale-target recovery, prove stale window targets are rejected, exact targets are re-resolved after restart or id churn, no stale input is delivered, and session cleanup ran. Never retry by kind without an identity receipt.
For menu syntax ambiguity, prove tolerant diagnostics, skipped malformed fragments, selected command identity, and no accidental execution before submitting any command.
For IME composition, prove composition start/update/commit boundaries, no premature submit/actions, and final committed text semantics. Plain key events are not enough.
For selected-text fallback, prove permission denial/staleness, redaction, fallback source, and safe action disablement. Never trust stale frontmost-app context or raw selected text logs.
For display migration visual bounds, prove source/target display identity, scale/rem metrics, focus/selection preservation, visible text bounds, screenshot-to-semantics alignment, wrong-display capture rejection, stale migration rejection, and no popup/main clobbering.
For native picker or external app return, prove origin surface identity, handoff request id, picker/external window identity, restored focus/selection/cursor, stale or foreign window event rejection, and no submit or selection mutation during handoff.
For drag cancellation, prove drag session identity, scoped payload fingerprint, redacted preview, hover/drop target cleanup, origin focus/selection restoration, no clipboard/file/attachment/prompt side effects, stale drag rejection, and foreign drop rejection.
For runtime appearance churn while focused input is active, prove surface/window identity, focus, text, visible text, cursor/selection, rem/font/scale/layout metrics, theme and renderer token generations, stale token repaint rejection, and wrong-surface mutation rejection.
For power resume recovery, prove pre-sleep target identity, post-wake target generation, stale target refusal before native/batch/GPUI/screenshot delivery, exact re-resolution, fresh state/elements/screenshot receipts, focus/selection preservation, and cleanup.
For menu/tray/notification interruption, prove active modal/prompt identity, interruption identity, wrong-surface action rejection, topmost modal preservation, no focus steal, no selection/input/cursor mutation, no prompt submit, and focus restoration.
For streaming progress cancellation, prove stream run identity, monotonic progress samples, visible progress text, cancellation request/ack ordering, stale post-cancel chunk rejection, no stale repaint, screenshot-to-state revalidation, focus/cursor restoration, no accidental submit, and cleanup.
For dictation/media permission readiness churn, prove passive microphone/model setup, permission and model readiness generations, churn event ordering, target identity, transcript generation identity, wrong-target delivery rejection, no auto-submit, no System Settings or TCC mutation, focus/cursor preservation, and cleanup.
For animation frame capture determinism, prove app animation frame ids, capture sequence ids, state/elements/screenshot receipts per sampled frame, visible text/layout fingerprints, motion occlusion pairs, stable frame ordering, stale-frame rejection, wrong-window rejection, blank-frame rejection, and cleanup.
For accessibility tree semantic parity, prove visible controls, automation elements, and AX nodes share roles, labels, focus order, tab order, disabled states, safe keyboard activation semantics, hit targets, screenshot-to-semantics alignment, stale-tree rejection, wrong-window rejection, and cleanup.
For RTL/bidirectional/emoji text rendering, prove direction runs, bidi levels, grapheme clusters, emoji ZWJ and combining mark sequences, cursor visual positions, selection rectangles, visible/rendered text bounds, truncation intent, search/filter semantics, stale layout rejection, wrong-surface mutation rejection, no accidental submit, and cleanup.
For high-volume virtualized list stability, prove fixture identity, row identity, virtualization generations, selected-row reanchor, selection reanchor, scroll anchor preservation, rapid filter ordering, stale result rejection, duplicate-key rejection, blank-row rejection, footer-safe selection, screenshot-to-semantics consistency, and cleanup.
For input-device modality transitions, prove hover/focus/selection affordances, pointer hover, keyboard focus, selection, trackpad/wheel scroll anchors, shortcut ownership, activation ownership, stale modality event rejection, wrong-surface input rejection, no accidental submit, screenshot-to-state revalidation, and cleanup.
For multi-context attachment dedupe/provenance, prove file, screenshot, selected-text, MCP resource, script resource, and clipboard snippet origins across ACP Composer and Notes, with attachment provenance, destination generations, dedupe keys, provenance fingerprints, redacted preview, remove/reorder receipts, stale provenance rejection, duplicate-id rejection, no privacy leaks, and cleanup.
For visual contrast readable state checks, prove active inactive disabled focused error loading states across themes, scale factors, and surfaces with theme token fingerprints, rem/scale metrics, text/color/bounds receipts, contrast ratios, non-color state cue coverage, screenshot-to-state revalidation, stale theme token rejection, wrong-surface rejection, and cleanup.
For empty/error/retry state UX, prove empty, loading, error, retry, and recovered states with visible text, semantic retry identity, footer-safe actions, stable selection, and no stale error after recovery.
For form validation and inline error recovery, prove invalid submit prevention, focus first invalid field, preserve user input, inline error identity, clear errors on valid edits, final submit recovery, prevent accidental submit, and no cross-field error leakage.
For navigation/back-stack history, prove transition generations, route stack depth, actions discoverability, disabled/no-op affordances, Escape/back/Cmd-K close behavior, and return-to-origin restore selection, filter, scroll, footer, and focus without stale surface state.
For long text wrapping/resizing UX stress, prove fixture identity, width mode, resize generation, full text, visible text, text/rendered/element bounds, available width, measured width, wrap line count, truncation intent, tooltip or accessible full text, overlap pairs, footer/input collision, focus and selection preservation, stale resize rejection, wrong-surface rejection, and cleanup.
For actions/command discoverability UX, use actions-command-discoverability-noop-stress to drive real Cmd-K action popups and prove visible action row ids, labels, sections, shortcuts, destructive/enabled flags, context identity, safe Escape cleanup, and no accidental execution. Treat disabled/no-op state warnings as coverage gaps until fixtures expose those row kinds.
For Notes window resizing UX, use notes-window-resize-stress to prove a real Notes window opened through openNotes, targeted Notes editor input changed through batch.setInput, a sandbox notes DB isolated user data, tall content grew the window, short content shrank it, width stayed stable, Notes stayed visible, and the launched session was stopped.
For dense list/detail preview readability, prove selected row identity, preview source identity, preview title/body bounds, metadata chip readability, footer action readability, filter generations, selection generations, resize generations, stale-preview rejection, row reanchor, focus preservation, no column/footer overlap, and cleanup.
For transient toast/notification feedback, prove queue generation, bridge generation, visible text, duplicate collapse, autohide/dismiss ordering, bounds/overlap, footer/input non-blocking, stale rejection, and no action execution from toast UI.
For destructive confirmation, prove dry-run-only fixture identity, confirm prompt identity, focused button, Enter/Escape resolution, no mutation before confirm, no mutation after cancel, no real system command request, stale/wrong-surface rejection, and parent focus/selection/filter restoration.
For loading skeleton/progress restoration, prove request/result generations, skeleton rows, progress text/percent monotonicity, activation blocking while loading, stale loading/progress/result rejection, skeleton cleanup after results, and selection/focus/filter/scroll restoration.
For icon/image fallback redaction, prove requested image source kind, redacted source fingerprint, fallback icon kind, fallback reason, image load generation, no raw path/URL/content leakage, stale image rejection, accessible label preservation, and cleanup.
For footer/status persistence, prove owner, native footer surface id, rendered buttons, shortcut labels, status generation, persistence across filter/selection/actions transitions, duplicate-footer rejection, stale-status rejection, wrong-surface rejection, and cleanup.
For keyboard hint label parity, prove footer, row accessory, tooltip, action catalog, normalized shortcut tokens, platform glyphs, disabled-state parity, activation owner, no accidental execution, stale-hint rejection, wrong-surface rejection, and cleanup.
For row state parity without native pointer input, prove selected, focused, hovered, and selected-hovered row states through semantic row ids, state/elements receipts, modality receipts, tokenized fill/focus/text/icon states, stale-row rejection, wrong-surface rejection, no accidental execution, and cleanup.
For quiet chrome/card nesting regressions, prove shell/content/row/popup/footer chrome layers, border/fill/shadow tokens, card depth, inset/gap/radius, duplicate-border rejection, opaque-fill rejection, stale-token rejection, wrong-surface rejection, and cleanup.
For scroll shadows, sticky headers, and density drift, prove scroll position, viewport/content bounds, sticky header bounds/z-index, scroll shadow opacity tokens, row/header/input/footer heights, rem/scale metrics, footer-safe viewport, selected-row visibility, stale-scroll rejection, wrong-surface rejection, and cleanup.
For popup focus/keycap visual semantics, prove popup owner identity, focused button/keycap parity, normalized shortcut glyphs, danger semantics on labels rather than keycaps, parent focus/selection preservation, stale focus rejection, wrong-surface rejection, no accidental execution, AFK-safe flags, and cleanup.
For reduced-motion animation disable behavior, prove fixture-only reduced-motion policy, animation/transition generations, stable opacity/transform/frame receipts, disabled shimmer/spinner/pulse motion, focus/selection/cursor preservation, stale motion rejection, wrong-surface rejection, no System Settings or TCC mutation, AFK-safe flags, and cleanup.
For command search highlighting/accessory badges, prove query/search generations, highlighted ranges, command row identity, accessory badge order/kinds/tooltips, disabled/no-op/loading reasons, action-catalog parity, stale highlight/badge rejection, wrong-host rejection, no accidental execution, AFK-safe flags, and cleanup.
For clipboard copy visual feedback, prove fixture-scoped pasteboard isolation, visible copied state, copied-state duration, copy toast receipt, redacted preview, payload fingerprint, unchanged system pasteboard fingerprints, stale copy rejection, wrong-host rejection, no accidental paste, AFK-safe flags, and cleanup.
For portal cancel/back return restoration, prove origin generation, draft/cursor/selection/filter/scroll before the portal, portal session identity, Escape/back cancel receipts, return target identity, restored focus/draft/cursor/selection/filter/scroll, no context insertion, no prompt submit, stale/foreign/wrong-origin rejection, AFK-safe flags, and cleanup.
For tooltip hover/focus affordances, prove protocol-hover and keyboard-focus triggers, target identity, tooltip generation, text/kind/anchor/bounds/placement, hover delay, accessible description parity, Escape/scroll/focus-loss dismissal, no focus steal, target focus preservation, no target/footer/popup-owner coverage, stale/wrong-surface rejection, AFK-safe flags, and cleanup.
For shortcut recorder cancel/layering UX, prove compact modal layering, parent/recorder bounds, placeholder and cancel affordance, Escape/Cmd-W/backdrop/parent-click cancellation, no chord capture, unchanged config fingerprints, no global hotkey registration, parent focus/selection restoration, stale/wrong-parent rejection, AFK-safe flags, and cleanup.
For inline popover anchor/resize UX, prove family/origin/popup identity, trigger range, anchor and popup bounds before/after resize, selected row visibility and identity, synopsis/footer bounds, no parent clipping or viewport overflow, z-order above parent, no focus steal, keyboard fallback, strict capture target, blank screenshot rejection, stale resize rejection, AFK-safe flags, and cleanup.
For disabled footer hit-target refusal, prove active footer/native footer identity, disabled primary action label/reason/visual/accessibility state, Enter/shortcut/protocol-click refusal, Cmd-K action availability, no submit receipt, unchanged side-effect counts and state fingerprints, focus/selection/filter preservation, stale/wrong-surface rejection, AFK-safe flags, and cleanup.
For mini/full transition layout continuity, prove Mini/Full/hide-show/return-to-origin transitions preserve rem/scale, window/content/input/list/footer bounds, native footer identity, focus ring bounds, selected row visibility above footer, no input/footer overlap, no clipping, no popup/main clobbering, screenshot-to-semantics alignment, strict capture target, blank screenshot rejection, stale mode rejection, AFK-safe flags, and cleanup.
For filter input decoration chip layout, prove rendered input text, stripped search text, chip ranges/roles/bounds, cursor and placeholder bounds, measured/available width, visible text, decoration/input generations, stale decoration clearing, no chip text/cursor/placeholder/footer overlap, no horizontal clipping, accessible full text, screenshot-to-semantics alignment, strict capture target, stale generation rejection, AFK-safe flags, and cleanup.
For focus ring viewport integrity, prove focused semantic id and owner, focus ring/focused element/viewport/content/footer/popup bounds, ring visibility, no clipping, ring within viewport and above footer, no footer/popup occlusion, stable tab order, preserved selection/scroll anchor, focus restoration after Escape, no activation/submit, stale focus rejection, AFK-safe flags, and cleanup.
For warning banner action/dismiss semantics, prove banner identity, visible text/bounds/text bounds, action and dismiss semantic ids, hover/focus state, action-vs-dismiss click receipts, non-color state cue, contrast ratio, no footer/input obstruction, stale banner rejection, AFK-safe flags, and cleanup.
For SelectPrompt keyboard multi-selection state parity, prove choice count, focused/selected/checked/visible row ids, selection count and footer labels, filter and selection generations, Cmd-A/space/range-toggle receipts, filter preservation, checked rows matching state/elements, no submit/activation, stale selection rejection, AFK-safe flags, and cleanup.
For File Search safe preview sanitization, prove selected row/file identity, preview generation/source/render kind/title/visible text/bounds/text bounds, byte limit/truncation, binary/missing/unsupported fallbacks, private path redaction, no raw path leak, no network/external service, no Quick Look/native picker/pasteboard mutation, stale preview rejection, and cleanup.
For HotkeyPrompt transient capture/cancel UX, prove prompt type and surface identity, capture panel/input semantics, placeholder text, captured chord tokens and HotkeyInfo, simulateKey capture, Escape/Cmd-W cancellation, null submit on cancel, unchanged config fingerprints, no global hotkey registration, no shortcut recorder route, parent focus restoration, stale/wrong-surface rejection, AFK-safe flags, and cleanup.
For Process Manager sort/header/detail panel stability, prove fixture identity, table header semantic ids, sort key/direction/generation, non-selectable section headers, selected process/pid identity, detail panel generation/source/title/metrics, CPU/memory/PID parity, row reanchor after sort, header aria-sort labels, disabled kill action, no process signal, stale sort/detail rejection, AFK-safe flags, and cleanup.
For EnvPrompt redacted status/error recovery, prove prompt/fixture identity, status generation/kind/text, inline error and first-invalid-field semantics, masked value visibility, secret redaction and fingerprinting, no raw secret leak, no secret/config writes, valid edit clearing errors, disabled submit reason, focus preservation, stale/wrong-field rejection, AFK-safe flags, and cleanup.
For command palette breadcrumb route-stack UX, prove host/topmost owner identity, route stack depth, breadcrumb trail labels and semantic ids, active route id, parent/child snapshots, drill-down push receipt, breadcrumb and Escape back receipts, preserved search text, restored selection/scroll anchor, no pre-drill onSelect, no accidental execution, stale route rejection, AFK-safe flags, and cleanup.
For root source-chip action semantics, prove rendered input text, stripped search text, source filter set, source chip semantic ids and roles, remove/clear-all/toggle-exclude/action-menu receipts, decoration generation, preflight indicators, non-selectable status chips, grouped rows suppress disallowed sources, blocked history recall, selection preservation, no status-as-action subject, stale chip rejection, AFK-safe flags, and cleanup.
For recent/history dedupe root grouping stability, prove fixture snapshot and source catalog generations, query/root/passive frame keys, visible result roles, group order, contiguous Files section, stable Search Files continuation, dedupe keys and rejected collisions, metadata-only history rows, no transcript/body leaks, stable selection key, row fingerprints across cycles, fallback suppression, stale passive publish rejection, AFK-safe flags, and cleanup.
For inline attachment preview chip stability, prove fixture attachment set identity, host surface/composer generation, chip semantic ids/kinds/labels/bounds, redacted preview fingerprints, overflow/focus/remove/reorder receipts, cursor and selection preservation, no raw path/content leak, no native picker/screen capture/pasteboard/network, stale attachment rejection, AFK-safe flags, and cleanup.
For window title/status semantics, prove resolved target identity, automation/native/semantic titles, visible status text, title/status generations, transition receipts, detached window parity, attached popup parent title preservation, error recovery, stale title/status rejection, no focus steal, AFK-safe flags, and cleanup.
For menu syntax capture validation chips, prove fixture catalog identity, filter input text, main hint snapshot, capture validation status, status chip labels, missing/malformed/unresolved field labels, fragment preview rows, priority choices row, Enter prevention when invalid, no payload write, no handler spawn, stale validation rejection, AFK-safe flags, and cleanup.
For ACP footer activity indicators, prove fixture agent event stream identity, host/footer owner, native footer surface id, GPUI/native footer dot status, activity status transitions, context-capture/tool-call/plan-update/permission-wait/cancelled/idle states, footer repaint generation, stable pulse tokens, preserved model label, no global AI footer button, no agent process spawn, no security prompt, stale/wrong-host rejection, AFK-safe flags, and cleanup.
For ACP model/history popover visual state, prove popup catalog identity, family/automation id/kind, anchor and popup bounds, selected/focused row ids, row visual-state tokens, current-model and history-recency badges, redacted history previews, empty/loading/error-recovered states, synopsis bounds, filter selection preservation, stale/wrong-popup rejection, AFK-safe flags, and cleanup.
For ACP context insertion preview parity, prove source/destination identities, portal session id, selection and preview generations, selected row id/title/kind/fingerprint, accepted context URI, inserted token alias and preview fingerprint, composer generation and replacement range, row preview matching inserted context, selection preservation, stale/selection-drift/wrong-destination rejection, no raw content leak, no picker/Quick Look/pasteboard/network/submit, AFK-safe flags, and cleanup.
For ACP slash/mention provider visibility, prove provider hint catalog identity, popup family, trigger/query text, readiness generation, provider visibility rows, hint text, unavailable/loading/error-recovered/filtered-empty states, hidden-until-resource-available behavior, provider-specific rows for slash and mention, selected/focused row ids, disabled-provider refusal, stale generation rejection, no native picker/Quick Look/network/submit, AFK-safe flags, and cleanup.
For ACP composer token keyboard edit parity, prove host surface identity, composer generation, token semantic ids/kinds/aliases/bounds, cursor-before/after-token positions, atomic Backspace/Delete removal, range removal, move-left/move-right receipts, token order before/after, pending context preservation, slash skill context preservation, pasted metadata preservation, cursor selection preservation, no partial token text leak, stale/duplicate/wrong-host rejection, no system pasteboard/native input/submit, AFK-safe flags, and cleanup.
For ACP transcript stream retry virtualization, prove fixture transcript/thread generations, virtualized message window, visible/message row ids, stream run/chunk sequence, monotonic chunk append, scroll anchors, bottom stickiness, user-scrolled-away preservation, assistant error identity/text, retry button identity, retry draft/request/recovery generations, no stale error after recovery, stale chunk and wrong-message retry rejection, stable virtualized row identity, no blank rows, no transcript body leaks, no agent process spawn/security prompt/network/submit, AFK-safe flags, and cleanup.
For ACP plugin skill entry thread affinity, prove fixture skill catalog identity, entry path, host surface identity, resolved ACP target, target thread id, embedded/detached thread reuse, selected skill id and file fingerprint, slash token text/range, pending skill context part URI, context binding to the target thread, composer generation, return origin snapshot, stale launcher and detached thread rejection, no auto-submit/agent process/security prompt/network, AFK-safe flags, and cleanup.
For Notes cart ACP handoff dedupe, prove sandbox notes store identity, fixture note ids, active note id, cart snapshot generation, cart item ids/dedupe keys, rejected duplicates, handoff session id, destination host identity/generation, staged context URIs, inline aliases, redacted preview fingerprints, dry-run consume generation, cancel restoration, switch-note cleanup, wrong-note and stale-cart rejection, no raw note body leak, no user notes mutation/network/agent process, AFK-safe flags, and cleanup.
For root Files source-filter pagination footer UX, prove fixture file provider identity, source filters, rendered and stripped input text, root frame key, provider/page generations, page size, visible file row ids/fingerprints, Search Files continuation row, selected stable key before/after, selected row visible above the footer, main list scroll metrics, near-bottom page request, page append and delayed provider publish not replacing the frame, duplicate key rejection, fallback suppression, non-selectable status chips, Quick Look/native picker/pasteboard/network/submit refusal, stale page rejection, AFK-safe flags, and cleanup.
For File Search directory breadcrumb restoration, prove fixture directory tree identity, redacted breadcrumb segments, only-in-filter chip identity, rendered and stripped search text, visible file rows, directory rows before/after navigation, selected file before/after, breadcrumb click reanchor, filter preservation, back/forward stack depth, scroll anchor restoration, preview generation, no raw path leak, native picker and Quick Look refusal, stale directory rejection, wrong-origin rejection, AFK-safe flags, and cleanup.
For Emoji Picker skin-tone/category UX, prove fixture emoji catalog identity, category tabs, selected category, sticky header bounds, skin-tone palette identity/bounds, variant ids, selected skin-tone token, row ids, ZWJ sequence ids, grapheme fingerprints, search generation, highlighted ranges, accessible label parity, preview glyph bounds, palette dismissal, category-switch selection preservation, no pasteboard mutation or emoji insertion, stale palette rejection, wrong-category rejection, AFK-safe flags, and cleanup.
For root Windows source-filter activation refusal, prove fixture window provider identity, source filters, rendered and stripped input text, root frame key, window snapshot and z-order generations, visible window row ids/fingerprints, selected stable key before/after, selected row visibility, actions subject stable key, dry-run activation receipt, Enter activation refusal, no native window activation or focus steal, duplicate key rejection, stale snapshot rejection, non-selectable status chips, AFK-safe flags, and cleanup.
For Notes markdown preview scroll sync, prove sandbox notes store identity, fixture notes, active note before/after, markdown fixture ids, editor and preview generations, rendered markdown block ids, preview fingerprints, cursor and selection ranges, editor and preview scroll anchors, sync delta, split pane bounds, preview-toggle and switch-note cleanup, focus restoration, no user notes mutation or raw note body leak, stale preview and wrong-note rejection, no pasteboard/network/external service/native input, AFK-safe flags, and cleanup.
For Quick Terminal ANSI scrollback search readability, prove fixture terminal transcript identity, terminal surface id, transcript generation, ANSI/SGR runs, wide-cell graphemes, combining marks, hyperlink spans with redacted fingerprints, stderr blocks, prompt continuation rows, viewport rows/range, search generation, search hit ids, highlighted cell ranges, selected hit visibility, wrap markers, cursor/prompt/footer bounds, stale transcript rejection, no shell command spawn, no raw hyperlink leak, no pasteboard/network/external service/native input, AFK-safe flags, and cleanup.
For script output inspector folding recovery, prove fixture script run id, output fixtures, stream generation, stdout/stderr blocks, ANSI stack frames, JSON lines, progress rewrite generation, exit badge kind/bounds, filter text and highlight ranges, stderr fold before/after, stack expanded state, clear-filter restoration, dry-run retry receipt, no handler spawn or process kill, selection scroll anchor restoration, no interleave drift, stale output and wrong-run rejection, no pasteboard/network/external service/native input, AFK-safe flags, and cleanup.
For App Launcher icon-grid keyboard navigation, prove fixture app catalog identity, grid generation, visible app ids, icon bounds/fingerprints, selected app before/after, selected cell bounds/visibility, keyboard neighbor map, row/column count, filter generation, rendered/stripped query, empty state, preview panel, truncated-name tooltip, no text overlap or footer collision, Enter launch refusal, stale catalog and wrong-app rejection, no app launch/pasteboard/network/native input, AFK-safe flags, and cleanup.
For Browser History time-grouped privacy, prove fixture history provider identity, history generation, time bucket ids, sticky header bounds, visible visit ids/fingerprints, favicon fallback ids, redacted URL fingerprints, rendered/stripped query, selection before/after, duplicate collapse, no raw private URL leak, no favicon network request, browser activation refusal, stale history and wrong-visit rejection, no pasteboard/network/native input, AFK-safe flags, and cleanup.
For Settings preferences search/reset preview, prove sandbox config identity, preference section ids, visible preference ids, control bounds and accessible names, values before/preview, dirty preference ids, rendered/stripped query, search highlights, reset preview/cancel restoration, disabled-control refusal, no config write or secret leak, stale preference and wrong-preference rejection, no pasteboard/network/native input, AFK-safe flags, and cleanup.
For Settings read-only detail panel navigation, prove fixture catalog id, settings surface id, selected section before/after filtering, detail panel generation, visible row labels and bounds, text/detail/footer bounds, empty-state copy, disabled Apply/Save reason, unchanged config fingerprint, no setup/security prompt, stale detail and wrong-section rejection, no System Settings/TCC/config write/native input, AFK-safe flags, and cleanup.
For Design Picker preview restore visuals, prove fixture design catalog id, active design before preview, preview id/generation, theme token fingerprints before/preview/restore, visible picker rows/labels/bounds/text bounds, selected preview row visibility, screenshot-to-semantics target identity when capture is enabled, Escape/Cmd-W restore preview-only state, unchanged persisted design/config fingerprint, stale preview and wrong-surface rejection, no design/config write/native input, AFK-safe flags, and cleanup.
For Dictation History transcript preview redaction, prove fixture dictation store id, transcript row ids, transcript/query generations, selected transcript before/after filters, preview generation/source/render kind, visible preview text bounds, redacted transcript fingerprint, missing-audio fallback copy, emoji/grapheme bounds, footer/input non-overlap, no raw transcript/audio path leak, no microphone/media permission request, stale transcript and wrong-row rejection, no System Settings/TCC/native input, AFK-safe flags, and cleanup.

The Pattern

Every verification follows the same core loop:

1. Build Only What the Change Can Break

cargo build 2>&1 | tail -5

Only rebuild when the touched files can invalidate the binary or helper you need to exercise.

Docs, skills, notes, or source-audit-only changes: skip build.
Bun or shell harness changes with no Rust protocol changes: reuse the current debug binary if it already exists and a healthy session can start.
Rust or runtime changes: run cargo build.

If you do build, it must complete with Finished. If it fails, fix the build error first.

2. Reuse or Start a Session

# First look for a healthy reusable session
bun scripts/agentic/session-state.ts --list
bash scripts/agentic/session.sh status default

# Start or resume a named session — works from any shell
# session.sh waits for the APP_READY log marker instead of sleeping
SESSION_JSON="$(bash scripts/agentic/session.sh start default 2>/dev/null)"
APP_PID="$(printf '%s' "$SESSION_JSON" | jq -r '.pid')"
PIPE="$(printf '%s' "$SESSION_JSON" | jq -r '.pipe')"
LOG="$(printf '%s' "$SESSION_JSON" | jq -r '.log')"
READY="$(printf '%s' "$SESSION_JSON" | jq -r '.ready // false')"
READY_WAIT_MS="$(printf '%s' "$SESSION_JSON" | jq -r '.readyWaitMs // 0')"

# Fallback only if readiness marker was not observed.
if [ "$READY" != "true" ]; then
  sleep 0.5
fi

Session commands:

bash scripts/agentic/session.sh start [NAME]    # Create or resume (default: "default")
bash scripts/agentic/session.sh send NAME CMD    # Send JSON command
bash scripts/agentic/session.sh status [NAME]    # Check session state (JSON)
bash scripts/agentic/session.sh stop [NAME]      # Stop and clean up
bun scripts/agentic/session-state.ts --session NAME  # Detailed state report
bun scripts/agentic/session-state.ts --list          # List all sessions

All commands emit stable JSON envelopes on stdout (schemaVersion, status, payload). Diagnostics go to stderr.

start is idempotent — re-running it resumes an existing healthy session.

Alternative (legacy, single-shell only):

PIPE=$(mktemp -u)
mkfifo "$PIPE"
export SCRIPT_KIT_AI_LOG=1
./target/debug/script-kit-gpui < "$PIPE" > /tmp/sk-test.log 2>&1 &
APP_PID=$!
exec 3>"$PIPE"
sleep 3

3. Show the Window Only When Needed

# Session-based (any shell)
bash scripts/agentic/session.sh send default '{"type":"show"}'
sleep 1.5

The app starts hidden. State-only proofs should usually skip this step entirely.

Show the window only for screenshots, native input, or other proofs that require the real visible surface.

4. Interact

Send commands via the session. Common commands:

S="bash scripts/agentic/session.sh send default"

# Set filter text
$S '{"type":"setFilter","text":"search term"}'

# Read current state without touching focus
bash scripts/agentic/session.sh rpc default '{"type":"getState","requestId":"s1"}' --expect stateResult

# Discover visible elements (returns semantic IDs)
bash scripts/agentic/session.sh rpc default '{"type":"getElements","requestId":"e1"}' --expect elementsResult

# Discover an attached popup or detached surface directly by target
bash scripts/agentic/session.sh rpc default '{"type":"getElements","requestId":"e2","target":{"type":"kind","kind":"actionsDialog","index":0}}' --expect elementsResult

# Select element by semantic ID (from getElements response)
bash scripts/agentic/session.sh rpc default '{"type":"batch","requestId":"b1","commands":[{"type":"selectBySemanticId","semanticId":"choice:0:apple","submit":true}]}' --expect batchResult

# When supported, mutate popup state directly instead of typing through native focus
bash scripts/agentic/session.sh rpc default '{"type":"batch","requestId":"b2","target":{"type":"kind","kind":"actionsDialog","index":0},"commands":[{"type":"setInput","text":"alias"}]}' --expect batchResult

# Trigger a built-in view
$S '{"type":"triggerBuiltin","name":"clipboard"}'
$S '{"type":"triggerBuiltin","name":"tab-ai"}'
$S '{"type":"triggerBuiltin","name":"emoji"}'
$S '{"type":"triggerBuiltin","name":"apps"}'
$S '{"type":"triggerBuiltin","name":"file-search"}'

# Simulate keys (dispatches to current view; not suitable for interceptor bugs)
$S '{"type":"simulateKey","key":"enter","modifiers":[]}'
$S '{"type":"simulateKey","key":"escape","modifiers":[]}'
$S '{"type":"simulateKey","key":"k","modifiers":["cmd"]}'
$S '{"type":"simulateKey","key":"w","modifiers":["cmd"]}'

# Prefer GPUI event dispatch over simulateKey when you need the real key pipeline
bash scripts/agentic/session.sh rpc default '{"type":"simulateGpuiEvent","requestId":"g1","target":{"type":"main"},"event":{"type":"keyDown","key":"down","modifiers":[]}}' --expect simulateGpuiEventResult

# Type individual characters (for views with text input)
$S '{"type":"simulateKey","key":"h","modifiers":[]}'

# Query ACP state (returns input, cursor, picker, accepted item, thread status)
bash scripts/agentic/session.sh rpc default '{"type":"getAcpState","requestId":"acp1"}' --expect acpStateResult

5. Capture Screenshots

mkdir -p .test-screenshots
bash scripts/agentic/session.sh send default '{"type":"captureWindow","title":"","path":"'"$(pwd)"'/.test-screenshots/step-01.png"}'
sleep 1

title is substring match. "" matches any window.
For embedded ACP in the main Script Kit window, use title: "" or the resolver-driven verify-shot.ts / window.ts flow. Do not assume the title contains ACP Chat.
Path must be absolute — use $(pwd)/ prefix.
Runtime captureWindow does not allow arbitrary /tmp/*.png output paths.
Always sleep 1 after capture for file write.
The screenshot must come from the real runtime surface you are verifying, not a synthetic component window.
Read the PNG to visually verify. Never assume correctness without checking.

6. Read Logs

grep -i "keyword" /tmp/sk-test.log | head -20

Log format: TIMESTAMP|LEVEL|CATEGORY|cid=CORRELATION_ID message

7. Cleanup

# Session-based (preferred)
bash scripts/agentic/session.sh stop default

# Verify the session is actually gone before reporting success
bash scripts/agentic/session.sh status default

# Legacy fd 3 cleanup (single-shell only)
# exec 3>&-
# rm -f "$PIPE"
# kill $APP_PID 2>/dev/null || true
# wait $APP_PID 2>/dev/null || true

Cleanup is mandatory, even after failures or interrupted runs.

Do not report PASS or FAIL until the session you started has been stopped.
If you launched Script Kit via session.sh, run session.sh stop NAME and verify the session is no longer alive.
If you launched Script Kit directly, kill that specific PID and wait for it.
Do not leave orphan script-kit-gpui processes behind from agentic testing.

8. Report

PASS: build succeeded + expected screenshots match + expected log output + cleanup confirmed
FAIL: describe what went wrong with evidence (screenshot, log line), then still clean up the launched process/session

Timing Guidelines

Action	Wait strategy
App startup	`session.sh start` readiness wait; fallback 0.5s only if `ready=false`
Warm session reuse	Prefer 0-1s `status` / resume over creating a fresh process
State-only proof	Aim for 3-10s total; no screenshot or OS focus
`show` window	0.3s macOS focus-settling delay
`setFilter`	1s sleep or waitFor stateMatch
`triggerBuiltin` (opens new view)	waitFor appropriate condition
`simulateKey` (view transition)	1.5s sleep
`simulateKey` (text input)	0.1s sleep
`captureWindow`	1s sleep (file write)
ACP context bootstrap	`waitFor(acpReady, timeout=8000)`
ACP picker open	`waitFor(acpPickerOpen, timeout=3000)`
ACP picker accept	`waitFor(acpItemAccepted, timeout=3000)`
ACP response streaming	10-20s or waitFor(acpStatus)

Rule: Use waitFor for all ACP state transitions. Only use fixed sleeps for macOS focus-settling (0.3s) and file I/O (1s screenshot write).

Rule: Do not add a fixed sleep 3 after session.sh start. The session wrapper is responsible for readiness. Only use the 0.5s fallback when ready=false.

Rule: If a non-visual proof is trending beyond this budget, stop and redesign around getElements, getState, waitFor, batch, exact targets, or session reuse before escalating.

Session Management

Use scripts/agentic/session.sh instead of hand-rolling mkfifo + exec 3> in ad hoc shells.

Rules:

Always use session.sh start instead of manual mkfifo + exec 3> for new verification flows
Use session.sh send for fire-and-forget stdin commands like show, triggerBuiltin, setFilter, and captureWindow
Use session.sh rpc for protocol requests that expect a typed response like getAcpState, getElements, waitFor, batch, and inspectAutomationWindow
Check session health with session.sh status or session-state.ts before sending commands
Stop sessions with session.sh stop when done — do not leave orphan processes
Treat cleanup as part of the test itself: a run is incomplete until the session is stopped and verified dead

Field Notes

These are practical lessons from real ACP verification runs in this repo.

If session.sh start reports a dead session even though the log reached STARTUP_READY, inspect the log before assuming the app crashed. In some debug runs the wrapper/forwarder dies while script-kit-gpui is still healthy. When that happens, switch to the legacy single-shell FIFO fallback so you can keep stdin open yourself.
window.ts and macos-input.ts --ensure-focus may fail against the debug binary because the process name is script-kit-gpui, not the bundled Script Kit app identity. If focus/capture helpers cannot find the app, use System Events targeting process "script-kit-gpui" directly.
For debug-only window capture, direct region screenshots via screencapture -R<x,y,w,h> can be more reliable than the bundle-oriented window resolver. Use runtime automation bounds or System Events window position/size to compute the region.
ACP pasted-text verification needs two deletes when the cursor is immediately after a newly inserted token: the first backspace removes the trailing space, the second removes the token atomically. Query getAcpState after each step so you do not misread a correct first delete as a failure.

Screenshot Assertion (verify-shot.ts)

Use verify-shot.ts for automated screenshot + state verification. It enforces the correct ACP verification order: state receipt first, screenshot second.

# Basic: capture screenshot with ACP state assertions
bun scripts/agentic/verify-shot.ts --session default \
  --label step-name \
  --acp-status idle \
  --acp-picker-closed \
  --acp-context-ready

# Assert picker is open after typing @
bun scripts/agentic/verify-shot.ts --session default \
  --label picker-open \
  --acp-picker-open

# Assert item was accepted after Enter/Tab
bun scripts/agentic/verify-shot.ts --session default \
  --label item-accepted \
  --acp-picker-closed \
  --acp-item-accepted

# State-only (skip screenshot)
bun scripts/agentic/verify-shot.ts --session default \
  --label quick-check \
  --skip-screenshot \
  --acp-input-contains "@context"

# Screenshot-only (skip state query)
bun scripts/agentic/verify-shot.ts --session default \
  --label visual-check \
  --skip-state

Available assertions:

Flag	Checks
`--acp-status STATUS`	ACP status equals value (idle, streaming, etc.)
`--acp-picker-open`	Picker overlay is visible
`--acp-picker-closed`	Picker overlay is closed
`--acp-input-contains STR`	Input text contains substring
`--acp-input-match STR`	Input text matches exactly
`--acp-cursor-at N`	Cursor at character index N
`--acp-item-accepted`	A picker item was accepted (lastAcceptedItem non-null)
`--acp-accepted-label STR`	lastAcceptedItem.label equals STR
`--acp-accepted-trigger STR`	lastAcceptedItem.trigger equals STR (@ or /)
`--acp-accepted-via KEY`	Probe confirms acceptance via enter or tab
`--acp-cursor-after-accepted N`	Probe confirms cursor landed at index N after acceptance
`--acp-context-ready`	Context bootstrap complete
`--acp-no-selection`	No text selection active (hasSelection is false)
`--acp-has-selection`	Text selection is active (hasSelection is true)
`--acp-no-permission`	No pending permission (hasPendingPermission is false)
`--acp-has-permission`	Pending permission present (hasPendingPermission is true)
`--acp-visible-start N`	inputLayout.visibleStart equals N (first visible char index)
`--acp-visible-end N`	inputLayout.visibleEnd equals N (last visible char index)
`--acp-cursor-in-window N`	inputLayout.cursorInWindow equals N (cursor position in viewport)

Exit codes: 0 = pass, 1 = assertion failure, 2 = infrastructure error.

Canonical input-stability proof

Use visible-text-window assertions to verify single-line input rendering and cursor tracking without a screenshot:

bun scripts/agentic/verify-shot.ts --session default \
  --label input-stability \
  --skip-screenshot \
  --acp-visible-start 12 \
  --acp-visible-end 52 \
  --acp-cursor-in-window 39

This proves the cursor is within the visible window and the viewport bounds are stable, which catches scroll jumps, layout shifts, and cursor-out-of-view regressions.

Recipe Orchestrator (index.ts) — Preferred ACP Verification

Default Surface Proof (Preferred)

Use surface-proof as the default seconds-first proof command for an already-open product surface. For main-hosted surfaces, enter through the real runtime command and keep the proof state-first.

bash scripts/agentic/session.sh start default
bash scripts/agentic/session.sh send default '{"type":"triggerBuiltin","name":"clipboard"}' --await-parse
bun scripts/agentic/index.ts surface-proof --session default --kind main
bash scripts/agentic/session.sh stop default
bash scripts/agentic/session.sh status default

# Advanced exact-target proofs when a popup or detached surface already exists:
bun scripts/agentic/index.ts surface-proof --session default --kind promptPopup --index 0
bun scripts/agentic/index.ts surface-proof --session default --kind acpDetached --index 0

Sample output shape:

{
  "schemaVersion": 1,
  "recipe": "surface-proof",
  "status": "pass",
  "summary": "State-first main proof succeeded for main",
  "proofBundle": {
    "schemaVersion": 2,
    "scenario": "main-window-exact-id",
    "surfaceClass": "main",
    "resolvedTarget": {
      "windowId": "main",
      "windowKind": "Main"
    },
    "targetIdentity": { "stable": true },
    "usage": {
      "stateFirst": true,
      "usedGetState": true,
      "usedGetElements": true,
      "usedScreenshot": false,
      "usedNativeInput": false,
      "usedShow": false,
      "usedFixedSleepMs": 0
    },
    "capabilities": {
      "state": true,
      "elements": true,
      "nativeInputRequired": false,
      "screenshotRequired": false
    },
    "state": { "type": "stateResult" },
    "elements": { "type": "elementsResult" },
    "warnings": []
  }
}

Canonical ACP proof commands

# Full ACP picker accept — choose key with --key enter|tab
bun scripts/agentic/index.ts acp-accept --session default --key enter
bun scripts/agentic/index.ts acp-accept --session default --key tab --vision

# Target a specific ACP window (detached/popup) — resolve exact identity first
RESOLVED="$(bun scripts/agentic/automation-window.ts resolve --session default --kind acpDetached --index 0)"
TARGET="$(printf '%s' "$RESOLVED" | jq -c '.targetJson')"
SURFACE_ID="$(printf '%s' "$RESOLVED" | jq -r '.surfaceId')"
bun scripts/agentic/index.ts acp-accept --session default --key enter \
  --target-json "$TARGET" --surface "$SURFACE_ID" --vision

Target threading (non-negotiable for multi-window ACP)

When verifying a detached or popup ACP window, resolve one target once and reuse it for every RPC and native input step in the entire run.

Canonical rule:

Discover the surface (e.g., bun scripts/agentic/window.ts list).
Pick one --target-json object (e.g., {"type":"kind","kind":"acpDetached","index":0}).
Pass that same target to every ACP RPC: getAcpState, getAcpTestProbe, resetAcpTestProbe, waitFor, and batch.
Pass the matching --surface value to native input so focus and proof stay on the same window.
Never mix focused-window ACP RPCs with surface-targeted native input in the same verification run. This causes cross-window false proof where you drive one ACP surface and verify another.

When --target-json is omitted, RPCs default to the main ACP view (existing behavior).

What acp-accept guarantees:

Resets ACP test probe before native interaction (no stale accepted items)
Uses macos-input.ts --ensure-focus for native typing and acceptance
Uses state-only checks for ACP-ready and picker-open (no intermediate screenshots)
Waits for acpAcceptedViaKey (key-specific proof, not generic acpItemAccepted)
Keeps exactly one final screenshot as visual proof
Emits vision crops only when --vision is requested
When --vision is used, surfaces the full proof bundle (with state, probe, screenshot, visionCrops) as proofBundle in the recipe receipt

Other recipes

# Check all prerequisites
bun scripts/agentic/index.ts preflight --session default

# Open ACP and verify ready state (state-only, no screenshot)
bun scripts/agentic/index.ts acp-open --session default

# Compatibility aliases (same as --key enter / --key tab)
bun scripts/agentic/index.ts acp-enter-accept --session default
bun scripts/agentic/index.ts acp-tab-accept --session default

# Hard-scenario recipes
bun scripts/agentic/index.ts acp-detached-target-threading-stress \
  --session default --kind acpDetached --index 0 --min-targets 2 --key enter --vision --json
bun scripts/agentic/index.ts acp-prompt-popup-parity \
  --session default --families mention,model-selector,local-history --json
bun scripts/agentic/index.ts notes-acp-delayed-action-origin-stress \
  --session default --drift generation --json
bun scripts/agentic/index.ts file-portal-origin-roundtrip \
  --session default --origin acp --portal file-search --selection file --query AGENTS.md --json
bun scripts/agentic/index.ts permission-privacy-preflight \
  --session default --kinds accessibility,screen-recording,microphone --json
bun scripts/agentic/index.ts shortcut-recorder-focus-capture \
  --session default --surface shortcuts --action test-agentic-shortcut --chord cmd+shift+7 --sandbox-config --json
bun scripts/agentic/index.ts template-prompt-automation-parity-stress \
  --session default --template 'Hello {{name}}' --field name --value Ada --forced-value forced-template-result --json
bun scripts/agentic/index.ts current-app-commands-frontmost-stress \
  --session default --alias 'Do in Current Command' --query 'close tab' --json
bun scripts/agentic/index.ts actions-captured-subject-frame-stress \
  --session default --source root-file --action quick-look --mutation filter-selection-cache-frame --json
bun scripts/agentic/index.ts drop-prompt-native-drop-privacy-stress \
  --session default --file-name agentic-drop.txt --size 12 --json
bun scripts/agentic/index.ts path-prompt-filesystem-edge-stress \
  --session default --json
bun scripts/agentic/index.ts screenshot-identity-acp-context-stress \
  --session default --source tab-ai-screenshot --json
bun scripts/agentic/index.ts clipboard-history-portal-range-stress \
  --session default --portal-id 'kit://clipboard-history?id=agentic' --range composer:0..0 --json
bun scripts/agentic/index.ts browser-tabs-cache-identity-stress \
  --session default --source browser-tabs --json
bun scripts/agentic/index.ts scroll-selection-reanchor-stress \
  --session default --kinds clipboard,browser-history,current-app-commands,file-search --json
bun scripts/agentic/index.ts accessibility-tree-semantic-parity-stress \
  --session default --surfaces main,actionsDialog,promptPopup --json
bun scripts/agentic/index.ts rtl-bidi-emoji-text-rendering-stress \
  --session default --surface acp-composer --text 'abc שלום 👩🏽‍💻 é مرحبا 123' --json
bun scripts/agentic/index.ts high-volume-virtualized-list-stability-stress \
  --session default --surface clipboard-history --fixture-count 5000 --filter-cycles 8 --scroll-cycles 12 --json
bun scripts/agentic/index.ts input-modality-transition-ownership-stress \
  --session default --surface main --interleave pointer-hover,keyboard-nav,trackpad-scroll,wheel-scroll,shortcut --cycles 8 --json
bun scripts/agentic/index.ts multi-context-attachment-dedupe-provenance-stress \
  --session default --origins file,screenshot,selected-text,mcp-resource,clipboard-snippet --destinations acp-composer,notes --reorder-cycles 3 --json
bun scripts/agentic/index.ts visual-contrast-readable-state-stress \
  --session default --surfaces main,actionsDialog,promptPopup,acp-composer,notes --themes light,dark --scale-factors 1,1.25,1.5 --states active,inactive,disabled,focused,error,loading --json
bun scripts/agentic/index.ts empty-error-retry-state-ux-stress \
  --session default --surfaces main,clipboard-history,emoji-picker,file-search --query 'agentic-loop-eighteen-no-results-zzzz' --json
bun scripts/agentic/index.ts form-validation-inline-recovery-stress \
  --session default --surface fields-prompt --fields email,required-text,number --invalid email:not-an-email,required-text:,number:not-a-number --valid email:ada@example.com,required-text:Ada,number:42 --json
bun scripts/agentic/index.ts navigation-back-stack-history-stress \
  --session default --origin main --surfaces clipboard-history,emoji-picker,file-search,actionsDialog --transitions triggerBuiltin,cmd-k,escape,back --json

State-only vs screenshot checkpoints

Checkpoint	Screenshot?	Probe?	Why
ACP ready	No	No	`waitFor(acpReady)` is sufficient proof; screenshot is waste
Picker open	No	No	`waitFor(acpPickerOpen)` is sufficient proof
Final accepted	Yes	Yes	The only checkpoint that needs visual + probe evidence

Rule: Intermediate checkpoints use state-only verification (--skip-screenshot --skip-probe). Only the final acceptance step captures a screenshot and queries the probe.

Receipt shape

Each recipe returns a machine-readable JSON receipt:

{
  "schemaVersion": 1,
  "recipe": "acp-enter-accept",
  "status": "pass",
  "steps": [
    { "name": "acp-open", "status": "pass" },
    { "name": "reset-probe", "status": "pass" },
    { "name": "type-at-trigger", "status": "pass" },
    { "name": "wait-accepted-via-key", "status": "pass" },
    { "name": "verify-accepted", "status": "pass" }
  ]
}

When --vision is used, a proofBundle field is added containing the verify-shot receipt with state, probe, screenshot, and visionCrops for direct machine consumption.

The wrapper does not replace the lower-level commands — use session.sh, macos-input.ts, window.ts, and verify-shot.ts directly when you need finer control.

ACP Golden Path (Critical)

The mandatory verification flow for any ACP interaction testing. Prefer the canonical CLI (bun scripts/agentic/index.ts acp-accept) over reconstructing the manual steps below.

Canonical (one command, fully non-interactive)

bash scripts/agentic/session.sh start default
bun scripts/agentic/index.ts acp-accept --session default --key enter --vision
# The recipe returns a machine-readable JSON receipt with proofBundle.
# Parse proofBundle.state, proofBundle.probe, proofBundle.screenshot, proofBundle.visionCrops
# to verify ACP behavior programmatically, then read the written PNG for final visual confirmation.
bash scripts/agentic/session.sh stop default

Exact detached ACP proof (preferred)

bash scripts/agentic/session.sh start default
bun scripts/agentic/index.ts scenario \
  --session default \
  --scenario detached-acp-exact-id \
  --index 0
bash scripts/agentic/session.sh stop default

Canonical with target threading (detached/popup ACP)

bash scripts/agentic/session.sh start default

# Resolve exact target and surface identity once
RESOLVED="$(bun scripts/agentic/automation-window.ts resolve --session default --kind acpDetached --index 0)"
TARGET="$(printf '%s' "$RESOLVED" | jq -c '.targetJson')"
SURFACE_ID="$(printf '%s' "$RESOLVED" | jq -r '.surfaceId')"

bun scripts/agentic/index.ts acp-accept --session default --key enter \
  --target-json "$TARGET" --surface "$SURFACE_ID" --vision
INSPECTED="$(bun scripts/agentic/automation-window.ts inspect --session default --id "$(printf '%s' "$RESOLVED" | jq -r '.automationWindowId')")"
WINDOW_ID="$(printf '%s' "$INSPECTED" | jq -r '.osWindowId')"

bun scripts/agentic/index.ts acp-accept --session default --key enter \
  --target-json "$TARGET" --surface "$SURFACE_ID" --vision
bun scripts/agentic/verify-shot.ts --session default --label detached-proof \
  --target-json "$TARGET" --capture-window-id "$WINDOW_ID"
# Confirm proofBundle.state.resolvedTarget.windowKind == "acpDetached"
# Confirm captureTarget.requestedWindowId == captureTarget.actualWindowId
bash scripts/agentic/session.sh stop default

Manual (when you need finer control)

1. session start                               → session alive
2. show                                        → window visible
3. triggerBuiltin tab-ai                       → ACP opens
4. waitFor(acpReady, timeout=8000)             → context bootstrapped (deterministic)
5. focus window                                → frontmost confirmed
6. native type @ (macos-input.ts --ensure-focus) → open picker
7. waitFor(acpPickerOpen, timeout=3000)        → picker open (deterministic)
8. native Enter or Tab (macos-input.ts --ensure-focus) → accept picker item
9. waitFor(acpAcceptedViaKey, timeout=3000)    → key-specific acceptance (deterministic)
10. verify-shot.ts with --acp-accepted-via     → state + probe + screenshot proof

Key tools in the golden path:

Tool	Role
`session.sh`	Cross-shell session management, RPC, lifecycle
`macos-input.ts`	Native macOS keyboard/mouse with `--ensure-focus`
`window.ts`	Window discovery, focus, activation, quartz capture
`verify-shot.ts`	State + probe + screenshot bundle with strict capture
`automation-window.ts`	Exact ACP target-to-surface resolver
`scenario.ts`	Replayable proof-bundle scenarios for cross-window regression
`index.ts`	Orchestrator that composes all of the above correctly

State receipt before screenshot is non-negotiable. If the state says the picker is still open but the screenshot looks fine, the test must FAIL.

Any remaining sleeps in the recipes are brief macOS focus-settling delays (~300ms) with explicit comments. They are not proof of ACP state.

Verification Recipes

See references/recipes.md for named verification patterns.

Other hard-scenario recipes:

bun scripts/agentic/index.ts long-text-wrap-resize-surface-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog --widths mini,narrow,full --fixtures long-name,long-path,long-description,multiline-snippet --json
bun scripts/agentic/index.ts actions-command-discoverability-noop-stress --session default --hosts main,clipboard-history,emoji-picker,file-search,app-launcher --states actionable,disabled,no-op --json
bun scripts/agentic/index.ts dense-list-detail-preview-readability-stress --session default --surfaces file-search,sdk-reference,script-template-catalog --query agentic-loop-nineteen-preview --filter-cycles 4 --selection-cycles 8 --resize-cycles 3 --json
bun scripts/agentic/index.ts toast-notification-queue-lifecycle-stress --session default --surface main --fixtures success,duplicate,persistent,dismiss,autohide --cycles 3 --json
bun scripts/agentic/index.ts destructive-confirm-modal-safety-stress --session default --host main --fixture agentic-destructive-dry-run --paths cancel,confirm,stale-confirm --dry-run-only --json
bun scripts/agentic/index.ts loading-skeleton-progress-restoration-stress --session default --surfaces sdk-reference,script-template-catalog --fixture delayed-local --cycles 4 --json
bun scripts/agentic/index.ts icon-image-fallback-redaction-stress --session default --surfaces app-launcher,file-search,clipboard-history --fixtures missing-file,corrupt-png,private-local-path,data-uri-redacted --json
bun scripts/agentic/index.ts footer-status-persistence-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog --transitions filter,selection,cmd-k,escape,clear-filter --json
bun scripts/agentic/index.ts keyboard-hint-label-parity-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog,menuSyntaxTriggerPopup --families footer,row-accessory,tooltip,action-catalog --json
bun scripts/agentic/index.ts row-state-parity-without-pointer-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog --states selected,focused,hovered,selected-hovered --json
bun scripts/agentic/index.ts quiet-chrome-card-nesting-stress --session default --surfaces main,clipboard-history,emoji-picker,file-search,actionsDialog,promptPopup --chrome quiet --json
bun scripts/agentic/index.ts scroll-shadow-sticky-header-density-stress --session default --surfaces clipboard-history,emoji-picker,file-search,app-launcher,actionsDialog --scroll-positions top,middle,bottom --density compact,default --json
bun scripts/agentic/index.ts popup-focus-keycap-visual-semantics-stress --session default --surfaces actionsDialog,menuSyntaxTriggerPopup,confirmPrompt --json
bun scripts/agentic/index.ts reduced-motion-animation-disable-stress --session default --surfaces main,actionsDialog,menuSyntaxTriggerPopup --fixture reduced-motion --json
bun scripts/agentic/index.ts command-search-highlighting-accessory-badges-stress --session default --hosts main,actionsDialog,app-launcher,menuSyntaxTriggerPopup --query agentic-loop-twenty-three --json
bun scripts/agentic/index.ts clipboard-copy-visual-feedback-stress --session default --hosts file-search,actionsDialog,app-launcher --fixture agentic-copy-preview --pasteboard-scope fixture --no-system-pasteboard --json
bun scripts/agentic/index.ts portal-cancel-return-state-restoration-stress --session default --origins acp-composer,notes --portal file-search --query AGENTS.md --cancel-methods escape,back --fixture repo-file --no-native-picker --json
bun scripts/agentic/index.ts tooltip-hover-focus-affordance-stress --session default --surfaces main,actionsDialog,app-launcher --targets truncated-row,disabled-action,footer-button --fixture agentic-tooltips --input-modes protocol-hover,keyboard-focus --no-native-pointer --json
bun scripts/agentic/index.ts shortcut-recorder-cancel-layering-stress --session default --surface shortcuts --action test-agentic-shortcut --cancel-methods escape,cmd-w,backdrop,parent-click --input-modes protocol-key,protocol-click --sandbox-config --no-config-write --json
bun scripts/agentic/index.ts inline-popover-anchor-resize-stress --session default --families acp-slash,acp-mention,menu-syntax-colon --widths mini,narrow,full --fixture agentic-inline-popover --input-modes protocol-key,protocol-resize --no-native-input --json
bun scripts/agentic/index.ts disabled-footer-hit-target-refusal-stress --session default --surfaces drop-prompt,fields-prompt,path-prompt --fixtures empty-drop,invalid-fields,missing-path --input-modes enter,footer-shortcut,protocol-footer-click --no-native-pointer --dry-run-only --json
bun scripts/agentic/index.ts mini-full-transition-layout-continuity-stress --session default --surfaces main,mini-prompt,fields-prompt,actionsDialog --transitions mini-to-full,full-to-mini,hide-show,return-to-origin --fixture agentic-mini-full-layout --input-modes protocol-key,protocol-resize --no-native-input --no-native-pointer --no-system-pasteboard --local-fixture-only --json
bun scripts/agentic/index.ts filter-input-decoration-chip-layout-stress --session default --surfaces main --queries 'f: AGENTS.md,c: agentic,~/script,:actions,;note,!command,literal\\:chip' --widths mini,narrow,full --scale-factors 1,1.25,1.5 --fixture agentic-filter-input-decorations --input-modes protocol-set-filter,protocol-resize --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --local-fixture-only --json
bun scripts/agentic/index.ts focus-ring-viewport-integrity-stress --session default --surfaces main,actionsDialog,fields-prompt,path-prompt --fixture agentic-focus-rings --input-modes protocol-key,simulate-gpui-event --steps tab,shift-tab,up,down,escape --no-native-input --no-native-pointer --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts warning-banner-action-dismiss-semantics-stress --session default --surface main --fixtures warning,actionable,dismissible,error --input-modes protocol-hover,protocol-click,protocol-key --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --local-fixture-only --json
bun scripts/agentic/index.ts select-prompt-multiselect-keyboard-state-stress --session default --surface select-prompt --fixture agentic-multiselect --choices 24 --selection-steps space,cmd-a,filter-preserve,clear-filter,range-toggle,escape-restore --input-modes protocol-key,batch --no-native-input --no-native-pointer --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts file-search-preview-sanitization-stress --session default --surface file-search --fixture agentic-safe-preview --preview-fixtures text,binary,large-text,missing-file,private-path,unsupported-kind --selection-cycles 8 --filter-cycles 4 --input-modes protocol-set-filter,protocol-key,batch --no-native-input --no-native-pointer --no-native-picker --no-quick-look --no-system-pasteboard --local-fixture-only --json
bun scripts/agentic/index.ts hotkey-prompt-transient-capture-cancel-stress --session default --surface hotkey-prompt --fixture agentic-transient-hotkey --chords cmd+shift+7,ctrl+space --cancel-methods escape,cmd-w --input-modes protocol-key,simulate-key --no-native-input --no-native-pointer --no-config-write --no-global-hotkey-registration --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts process-manager-sort-detail-panel-stability-stress --session default --surface process-manager --fixture agentic-process-table --sort-keys name,cpu,memory,pid --selection-cycles 8 --filter-cycles 4 --input-modes protocol-click,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-process-kill --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts env-prompt-redacted-status-error-recovery-stress --session default --surface env-prompt --fixture agentic-env-status --status-fixtures missing-secret,parse-error,masked-existing,valid-edit --input-modes protocol-set-input,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --no-secret-write --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts command-palette-breadcrumb-route-stack-stress --session default --host main --fixture agentic-actions-breadcrumbs --drill-path parent-action,child-action --filter 'switch' --back-methods escape,breadcrumb-click --input-modes protocol-key,protocol-click,batch --no-native-input --no-native-pointer --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts root-source-chip-action-semantics-stress --session default --queries 'f: AGENTS.md,c: agentic,n: welcome,-c: noise' --actions remove-chip,clear-all,toggle-exclude,open-chip-actions --input-modes protocol-click,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-config-write --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts recent-history-dedupe-root-grouping-stress --session default --fixture agentic-root-recents --sources files,notes,clipboard,dictation,acp-history --query agentic-loop-29-dupe --cycles 6 --input-modes protocol-set-filter,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts inline-attachment-preview-chip-stability-stress --session default --hosts acp-composer,notes --fixture agentic-inline-attachments --origins local-file,fixture-image,fixture-text,script-resource --chip-actions focus,preview,remove,reorder,overflow --input-modes protocol-set-input,protocol-click,batch --no-native-input --no-native-pointer --no-native-picker --no-screen-capture --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts window-title-status-semantics-stress --session default --surfaces main,acp-composer,actionsDialog,promptPopup,notes --states idle,busy,error,dirty,ready --transitions triggerBuiltin,cmd-k,escape,hide-show --input-modes protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts menu-syntax-capture-validation-chip-stress --session default --fixture agentic-capture-validation --cases missing-body-date,missing-date,ready,malformed-url,unresolved-date,dynamic-schema --input-modes protocol-set-filter,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts acp-footer-activity-indicator-stress --session default --hosts acp-composer,notes --fixture agentic-acp-footer-activity --activity-fixtures context-capture,tool-call,plan-update,permission-wait,cancelled,idle-recovered --input-modes protocol-state,batch --agent-fixture scripted-local --no-native-input --no-native-pointer --no-security-prompts --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts acp-model-history-popover-visual-state-stress --session default --families model-selector,local-history --fixture agentic-acp-popover-visual-state --states idle,filtered,empty,loading,current-selection,error-recovered --selection-cycles 8 --filter-cycles 4 --input-modes protocol-set-input,protocol-key,batch --no-native-input --no-native-pointer --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json
bun scripts/agentic/index.ts acp-context-insertion-preview-parity-stress --session default --sources file-search,browser-history,dictation-history,notes --destination acp-composer --fixture agentic-context-preview-parity --selection-cycles 6 --filter-cycles 4 --insert-modes protocol-accept,batch --input-modes protocol-set-filter,protocol-key,batch --no-native-input --no-native-pointer --no-native-picker --no-quick-look --no-system-pasteboard --no-network --no-submit --dry-run-only --local-fixture-only --json

Adjacent Skills

Use adjacent skills when the work crosses boundaries:

$testing-quality-gates for choosing narrow build/test gates.
$protocol-automation when stdin JSON, receipts, target identity, waitFor, or batch are the behavior owner.
The domain skill for the active surface, such as $acp-chat-core, $actions-popups, $keyboard-focus-routing, $launcher-surface-contracts, or $window-resizing.

Migration Notes

Key Gotchas

simulateKey does NOT go through GPUI's intercept_keystrokes(). Use triggerBuiltin for ACP Chat entry, not simulateKey Tab.
AcpChatView accepts single-char simulateKey for typing, enter for submit, w+cmd for close.
Attached popups like ActionsDialog and PromptPopup can expose targeted state snapshots even when they do not expose an independent GPUI key handle. Read state from the popup target first; only escalate to parent-window or native input when you must drive real key delivery.
simulateGpuiEvent is better than simulateKey for interceptor bugs, but handle_unavailable means the target has no usable runtime handle for that path. Treat that as a proof-design problem, not a cue to spam retries.
The app window auto-hides when focus is lost. If captures fail with "Window not found", the window was dismissed.
captureWindow filters out windows under 100x100 (tray icons).
Always unset API keys if you need the setup card: unset ANTHROPIC_API_KEY.
For ACP picker testing, use native macOS input (macos-input.ts --ensure-focus) instead of simulateKey — synthetic keys bypass GPUI's native key interception and do not faithfully exercise picker selection behavior.
Use getAcpState to verify picker acceptance, cursor landing, and input content — do not rely solely on screenshots for ACP state verification.
Use waitFor commands via session.sh rpc for deterministic ACP state transitions — do not use fixed sleeps as proof of ACP state.

agentic-testing

More from this repository

More from this repository

Agentic Testing

Human-First Purpose

Canonical Ownership

First Reads

Workflow

When to Use

Human-First, Receipts-Backed Default

Safety Rules (MANDATORY)

AFK-Safe Proof Gate

User Bug Triage

Stop Rule

Forbidden-State Simulation

Bug Result Schema

Surface Identity Rules (MANDATORY)

Visual Diagnostics

Hard Interaction Boundaries

The Pattern

1. Build Only What the Change Can Break

2. Reuse or Start a Session

3. Show the Window Only When Needed

4. Interact

5. Capture Screenshots

6. Read Logs

7. Cleanup

8. Report

Timing Guidelines

Session Management

Field Notes

Screenshot Assertion (verify-shot.ts)

Canonical input-stability proof

Recipe Orchestrator (index.ts) — Preferred ACP Verification

Default Surface Proof (Preferred)

Canonical ACP proof commands

Target threading (non-negotiable for multi-window ACP)

Other recipes

State-only vs screenshot checkpoints

Receipt shape

ACP Golden Path (Critical)

Canonical (one command, fully non-interactive)

Exact detached ACP proof (preferred)

Canonical with target threading (detached/popup ACP)

Manual (when you need finer control)

Verification Recipes

Adjacent Skills

Migration Notes

Key Gotchas

Agentic Testing

Human-First Purpose

Canonical Ownership

First Reads

Workflow

When to Use

Human-First, Receipts-Backed Default

Safety Rules (MANDATORY)

AFK-Safe Proof Gate

User Bug Triage

Stop Rule

Forbidden-State Simulation

Bug Result Schema

Surface Identity Rules (MANDATORY)

Visual Diagnostics

Hard Interaction Boundaries

The Pattern

1. Build Only What the Change Can Break

2. Reuse or Start a Session

3. Show the Window Only When Needed

4. Interact

5. Capture Screenshots

6. Read Logs

7. Cleanup

8. Report

Timing Guidelines

Session Management

Field Notes

Screenshot Assertion (verify-shot.ts)

Canonical input-stability proof

Recipe Orchestrator (index.ts) — Preferred ACP Verification