with one click
agent-device-evidence
// Records iOS/Android native MP4 evidence for test/repro flows extracted from an Expensify GitHub PR or issue. Use when the user asks to "record the flow for PR
// Records iOS/Android native MP4 evidence for test/repro flows extracted from an Expensify GitHub PR or issue. Use when the user asks to "record the flow for PR
Provides coding standards for React Native ā performance patterns, consistency rules, and clean React architecture. Use when writing, modifying, or reviewing code.
Use when measuring a Sentry performance span locally with an agent-device replay flow on iOS simulator or Android emulator.
Drive iOS and Android devices for the Expensify App - testing, debugging, performance profiling, bug reproduction, and feature verification. Use when the developer needs to interact with the mobile app on a device.
Analyze Sentry issues, spans, crashes, and performance metrics. Use when user requests check in Sentry, asks about performance metrics and spans or asks about crash rates.
Test the Expensify App using Playwright browser automation. Use when user requests browser testing, after making frontend changes, or when debugging UI issues
Onyx state management patterns ā useOnyx hook, action files, optimistic updates, collections, and offline-first architecture. Use when working with Onyx connections, writing action files, debugging state, or implementing API calls with optimistic data.
| name | agent-device-evidence |
| description | Records iOS/Android native MP4 evidence for test/repro flows extracted from an Expensify GitHub PR or issue. Use when the user asks to "record the flow for PR |
| allowed-tools | Bash(agent-device *) Bash(gh pr view *) Bash(gh issue view *) Bash(mkdir -p *) Bash(file *) Bash(test *) Bash(date *) Read Write |
Records iOS: Native and Android: Native MP4 evidence for the test or repro steps declared in an Expensify GitHub PR or issue. The source of truth is the test/repro steps themselves, not the surrounding code or context - the skill works equally well on a PR's ### Tests section, an issue's ## Action Performed: block, or any future Markdown body where steps are clearly authored.
Specializes the agent-device skill: delegates device lifecycle (bundle ID, Metro, device pick, session, open) to its Bring-up, then captures one artifact per declared flow per platform, writes a JSON manifest, and surfaces local file paths.
The skill is autonomous and non-interactive. It never pauses for user input mid-run. All inputs are provided at invocation time; all failures surface as structured errors with exit codes.
HybridApp-only (the parent skill's pre-flight enforces this). Standalone (non-HybridApp) builds are out of scope - production mobile evidence runs against HybridApp.
In scope: iOS: Native (iOS Simulator), Android: Native (Android Emulator), HybridApp dev build only. Inputs may come from PRs or issues - the skill does not gate on code changes.
Out of scope: Android: mWeb Chrome, iOS: mWeb Safari, iOS: mWeb Chrome, Windows: Chrome, MacOS: Chrome / Safari. Decline with EXIT 4 and point to a browser-driver skill (playwright-app-testing). Standalone (non-HybridApp) builds. Decline with EXIT 7 BRING_UP_FAILED per the parent skill's gate.
| Input | Source | Required |
|---|---|---|
| Source URL (PR or issue) | First positional arg, e.g. https://github.com/Expensify/App/pull/89475 or .../issues/89855 | Yes |
--platforms ios,android | Flag | No (default: derived) |
-e KEY=VALUE step-param overrides | Repeatable | No |
Bare numbers are rejected (PRs and issues share the GitHub number namespace; the URL path is the safe disambiguator). No interactive prompts.
Detect source kind from the URL: /pull/N ā PR, /issues/N ā issue. Anything else ā exit 8 BAD_INPUT.
Fetch the source body:
gh pr view <num> --json title,bodygh issue view <num> --json title,body,labelsPlatform resolution - in priority order:
--platforms arg (CSV, wins all).### Tests body - iOS only, Android only, On iOS:, On Android:.## Platforms: checkbox list. Filled boxes denote where the bug reproduces; restrict to the matching native platforms.ios and android.Aliases: iOS: Native ā” iOS: App (both ā ios); Android: Native ā” Android: App (both ā android). All mWeb / Windows / MacOS variants are out of scope.
If the only platforms matched are out of scope (e.g. an issue checks only MacOS: Chrome / Safari), exit 4 PLATFORM_UNSUPPORTED.
Steps parsing - extract the steps section and produce a flow list (see below). If the flow list is empty, exit 3 NO_FLOWS.
See references/steps-parsing.md.
Simple map: flow steps ā .ad script. If the steps haven't changed, reuse the cached script and skip the warm-up.
~/.cache/agent-device-evidence/.ad-cache/<fingerprint>.adsha256(precondition + json(steps) + platform). Platform is included so iOS and Android don't share an entry (different selectors).$TEST_FLOW.ad, mark cached: true in the manifest, skip Phase 1, proceed to Phase 2.The skill does not delete, invalidate, or retry cache entries. If a cached .ad is stale, the flow is marked phase2_failed. To recover, edit the steps (which changes the fingerprint) or wipe ~/.cache/agent-device-evidence/.ad-cache/ externally.
Two phases per flow. Lifecycle delegated to the parent skill's bring-up. Phase 1 is skipped on cache hit (see above).
Run the agent-device bring-up for the target platform. The parent skill resolves bundle ID, starts Metro, picks/confirms the device, manages session, and opens the app for sanity verification. Capture the resolved $APP_ID (bundle ID) and $DEVICE_NAME for re-opens in Phases 1 and 2.
7 BRING_UP_FAILED and surface the parent skill's error verbatim.flows/README.md.agent-device devices --json order, deterministically. Log the choice in the manifest under device_selected.reset for sessions not created in the current invocation - run agent-device close --shutdown --session <name> without prompting. Phase 1 and Phase 2 both rely on cold starts, so reuse of stale sessions is never desired here.Close the bring-up session so each phase starts cold:
agent-device close
Set up run directory - persistent, append-only:
SOURCE_KIND=<pr|issue>; SOURCE_NUM=<num>; RUN_TS=$(date -u +%Y%m%dT%H%M%SZ)
RUN_DIR="$HOME/.cache/agent-device-evidence/$SOURCE_KIND-$SOURCE_NUM/$RUN_TS"
mkdir -p "$RUN_DIR/ios" "$RUN_DIR/android"
Goal: produce a deterministic .ad script of the successful command sequence, plus per-step still candidates. Drives autonomously from cold start. No recording.
Skip if cached. Before any device work, consult the Phase 1 cache. On hit, copy the cached .ad to $TEST_FLOW.ad, mark cached: true in the manifest, and proceed straight to Phase 2.
On cache miss:
Open the app with the bring-up's resolved values:
agent-device open "$APP_ID" --device "$DEVICE_NAME"
Drive setup actions based on the flow's Precondition: block (if any) and what the steps imply. Setup actions go into the .ad script up to the marker; everything after the marker is what Phase 2 records.
Drive the test flow - one numbered step at a time. For each step:
$TEST_FLOW.ad. Do not append actions that needed retries on different selectors.params: in the manifest.agent-device snapshot (taken for selector matching) is saved as a candidate still - flow-<id>-step-<n>-<label>.png. Free side-effect.Verify final state - agent-device is exists "<selector>" on the post-condition implied by the last step.
Close session - agent-device close.
Sanity-check the script is non-empty:
test -s "$TEST_FLOW.ad" || { record per-flow status "phase1_failed: empty script"; continue }
Write to cache - on success, copy $TEST_FLOW.ad to ~/.cache/agent-device-evidence/.ad-cache/<fingerprint>.ad and write the meta sidecar.
Goal: clean MP4 of only the test-flow steps. No snapshots on camera, no retries, no LLM thinking time.
Open the app fresh with the bring-up's resolved values:
agent-device open "$APP_ID" --device "$DEVICE_NAME"
Replay setup silently - everything in the .ad script up to the marker. Off-camera. The app reaches the test starting state.
Start recording:
agent-device record start "$RUN_DIR/$PLATFORM/flow-$ID.mp4" --fps 24
Android:
adb screenrecordhas a 3-min hard cap. Per-flow MP4s rarely hit this; if a flow exceeds, markstatus: phase2_failedand continue.
Replay test-flow portion:
agent-device replay "$TEST_FLOW.ad" --from-marker
Stop recording:
agent-device record stop
Close session - agent-device close.
Verify artifact:
test -s "$RUN_DIR/$PLATFORM/flow-$ID.mp4" && file "$RUN_DIR/$PLATFORM/flow-$ID.mp4" \
|| { mark phase2_failed; continue }
On Phase 2 replay failure: mark the flow phase2_failed and continue to the next flow.
Multiple flows in one PR share a single Phase 2 session (one agent-device open + replay-to-marker), with record start / record stop per flow. State carries between flows unless Phase 1 flagged requires_cold_start: true for a flow, in which case Phase 2 closes and re-opens before that flow.
For flows classified kind: still:
agent-device screenshot, and writes flow-<id>.png. No record start/stop.~/.cache/agent-device-evidence/
āāā .ad-cache/ # cross-source Phase 1 cache (see "Phase 1 cache")
ā āāā <fingerprint>.ad
ā āāā <fingerprint>.meta.json
āāā <source-kind>-<source-num>/ # per-source run output, e.g. pr-89475/ or issue-89855/
āāā <run-ts>/
āāā manifest.json
āāā ios/
ā āāā flow-1.mp4
ā āāā flow-1-step-2-tap-signin.png
ā āāā flow-2.png (still-only flow)
ā āāā ...
āāā android/
āāā ...
Run output is persistent across reboots and append-only - the skill never deletes prior runs or cache entries.
See references/manifest-schema.md.
After all platforms, the skill prints the run directory and lists per-flow paths. The user drags each artifact into the PR's ### Screenshots/Videos section (or attaches to the issue, depending on source). The skill never edits the source.
| Code | Meaning |
|---|---|
0 | All applicable flows produced an artifact (or the run was best-effort with at least one usable artifact; per-flow status reflects reality). |
3 | NO_FLOWS - steps section unparseable or empty after stripping. |
4 | PLATFORM_UNSUPPORTED - mWeb / Desktop / Windows requested or only out-of-scope platforms checked on the source. |
5 | PHASE1_TOTAL_FAILURE - every flow failed Phase 1. |
6 | PHASE2_TOTAL_FAILURE - every flow failed Phase 2 despite Phase 1 success. |
7 | BRING_UP_FAILED - parent skill bring-up failed (missing dev build, HybridApp gate, Metro start, simulator boot, etc.). Parent error is surfaced verbatim. |
8 | BAD_INPUT - source URL is missing, malformed, or not a recognised PR/issue URL. |
| Cap | Value |
|---|---|
| Phase 1 timeout | 5 min per flow |
| Phase 2 timeout | 3 min per flow (Android cap) |
| Max driver actions | 50 per flow |
Hitting any cap marks the flow phase1_failed / phase2_failed and proceeds to the next flow.
See references/error-handling.md.
The skill must not attempt any of the following. If a request implies one of these, refuse or delegate.
iOS: mWeb Safari, Android: mWeb Chrome, MacOS: Chrome / Safari) - belong in playwright-app-testing or a future browser-driver skill. Exit 4 PLATFORM_UNSUPPORTED.agent-device metro prepare, xcrun simctl, or is-hybrid-app.sh directly.