| name | web-asset-capture |
| description | Explore a website (public OR behind a login) and capture high-quality assets —
retina stills and crisp motion video (.mp4; .webm opt-in) — into the asset library, by
driving a REAL Chrome attached over CDP. Use when: (1) an agent needs UI footage/
stills from a page, especially an authenticated dashboard/app/console, (2) someone
says "capture my app / grab the dashboard / record this flow", (3) you need
UI-as-texture material (FloatingScreen / CaptureSurface / screen-rec) for a video.
The agent EXPLORES first (it does not assume what's on the site), reports an
inventory, then captures what the designer chooses. Never one-shot.
|
Web Asset Capture (agentic)
Two layers, one clean split. The agent owns judgment; scripts own mechanics.
- Explore + decide = Playwright MCP — the model's native browser tools
(
browser_navigate, browser_snapshot, browser_click, browser_take_screenshot,
…). The agent wanders an unknown app, reads accessibility snapshots, finds what
exists. ALL navigation judgment lives here.
- Capture =
scripts/capture.mjs — deterministic. Retina stills + crisp
screencast → x264 CRF18 motion, written to the asset library + a single-writer
catalog. MECHANICS only — zero judgment.
Both drive the same real Chrome — a shared, logged-in profile attached over CDP —
so the agent sees exactly what you see, login included.
The browser: one shared logged-in profile, attached over CDP
- Profile:
/Users/tk/.capture-chrome — repo-agnostic; log ALL your accounts into
it once. Not /tmp (survives reboot).
- It runs as real Google Chrome with
--remote-debugging-port=9222
(CDP endpoint http://127.0.0.1:9222). It's a NON-default profile, so Chrome v136+
honors the debug flag (it silently ignores it on the Default profile).
- Playwright MCP is wired to it (
--cdp-endpoint http://127.0.0.1:9222, project-
local scope). capture.mjs attaches to the same endpoint. They share the session.
- Bring it up:
node scripts/capture.mjs ensure → {"endpoint":"…","running":true}.
Idempotent — launches the shared Chrome only if the port is down.
- Re-auth (the one manual touch): sessions eventually expire. When a probe lands on
a login wall, run
node scripts/capture.mjs ensure --reauth --url <loginUrl> — it
opens a HEADED window; log in once (real Chrome, so Google SSO works); the session
persists on disk. NEVER capture a login wall as if it were content.
Why CDP-attach and not a fresh headless launch: on macOS a freshly-launched
Playwright Chrome can't get Keychain access to decrypt Chrome's cookies → it loads
logged-OUT, even on a profile that's logged in. The real, user-launched Chrome holds the
live decrypted session; we attach to it (connectOverCDP(..., {noDefaults:true}) — Chrome
148 rejects the default Browser.setDownloadBehavior attach call, so noDefaults is
required). Full rationale + the rejected alternatives:
shared/research/playwright-logged-in-session-reuse-2026-06-01.md.
What to capture & how much (the judgment layer)
This is the part the agent OWNS — read it, interpret it, decide. It is guidance, not a
schema; nothing here is hard-coded in capture.mjs. The DOM tells you what's available;
the video's intent tells you what's needed. Capture is demand-driven, not supply-driven
— every asset must earn its place by expressing a beat. If nothing in the edit will use it,
don't capture it.
Two inputs to the decision:
- Supply (DOM, via MCP snapshot) — what this specific site actually offers: a hero, a
card grid, a diagram, a chart, console views.
- Demand (the video's intent) — what the piece is trying to say (its claims / narrative
spine / audience). For a real video, ask for this. With no script yet, propose a DEFAULT
expressive manifest from the DOM + the roles below, then prune against intent. DOM alone
gets you a candidate; intent prunes it.
Asset roles — what each is good for (soft taxonomy; map the page's surfaces onto these):
- Hero / establishing — "this is the product." ~1.
- Atomic proof unit — variety + specificity (a single real card / row). Capture a
representative handful (6–12) of the recognizable ones, never the full set.
- Scale shot — magnitude ("86 models") — the wide grid that reads as "a lot." ~1.
A plain
still --selector of a tall virtualized grid UNDER-MOUNTS: it captures the
element's full bounds but only the rows that mounted during settle (the tokenrouter
models wall left ~2/3 whitespace). For a scale shot or the full set use cards
(scroll-mounts each step) or still --scroll-y (mount below-fold rows first).
- Diagram / concept — the mental model. 1–2.
- Data / evidence — one capture per distinct quantitative claim (chart, stat, number).
- Flow / motion — how it works / what it feels like (a real action / scroll / zoom).
One per process beat.
- State / detail — an emphasis close-up (hover, selected, single stat) — only when a
beat lands on it.
How much: derived from the beats, not from the page. For repeating sets the rule is
representative, not exhaustive — enough to land the claim + a small margin.
Extract vs skip: skip site chrome that won't express anything — top nav bands (and their
PII), cookie bars, footers, breadcrumbs, pagination chrome, empty states. A surface is worth
capturing only if you can name the beat it serves.
Capture sequencing: issue the captures sequentially — compose a sequence of
deterministic capture.mjs calls from the manifest (one still per element, one cards per
grid, one record per motion beat). No parallel multi-tab: cards already amortizes the
attach over a whole grid, stills are cheap, and motion must be sequential (screencast is
per-page and x264 is CPU-bound — parallel records contend and corrupt timing). Different views
= re-navigate the same tab or a fresh call.
Downstream cropping is deferred (open): deliver clean atomic assets and let
reframing / zoom / masking happen at compose time in Remotion, per scene, where the scene
knows how it wants to use the asset. Whether any capture-side secondary crop is needed gets
decided only after a full end-to-end loop has run.
The loop: explore → anchor → capture-by-selector → ONE contact-sheet QC
Anchor-first. Coordinates come from the DOM, never from eyeballing pixels. Resolve
each asset to a stable selector, then let locator.screenshot() auto-scroll to it and
crop AT its rendered bounds — pixel-exact, device-scaled, one shot per element. Vision is
QC-only: a SINGLE contact sheet, read once. This dissolves the old full-capture → magick-
regrid → Read → re-grid → crop-by-guessed-pixels loop (that loop stalled the agent at the
600s watchdog and produced DPR/pitch artifacts). Brief:
shared/research/efficient-element-asset-capture-2026-06-01.md.
- Ensure the browser:
node scripts/capture.mjs ensure.
- Explore + anchor with Playwright MCP — do NOT assume what's on the site:
browser_navigate to a page; let it settle.
browser_snapshot → the accessibility tree with stable [ref=eN] ids
(10–100× cheaper than screenshots). Read it to see what exists; act by ref,
never from pixels.
- Resolve each target to a stable selector: prefer a
data-* / role / aria
hook; fall back to the repeating grid-child class (e.g. .semi-card.cursor-pointer
on the tokenrouter wall). For a repeating grid, infer ONE selector from a couple of
sample cards, then replicate to all (CherryPick pattern). Avoid fragile positional
selectors (.nth(8)); durable anchors survive re-render.
- When several siblings share a class (e.g. one
.semi-card of five), anchor by a
unique descendant via :has() — never by text or :nth. The dashboard chart card
resolves cleanly as
.semi-card:has(.semi-card-header-wrapper-title > div.lg\:justify-between).
Cheap-revalidate the disambiguating selector with capture.mjs check (expect
count:1) before trusting it.
browser_click / scroll / type by ref to discover states (modals, tabs, empty vs
populated). Re-snapshot after every navigation — refs go stale on re-render.
- Enumerate nav/link roles to build a site map of reachable views. Report an
inventory: "here's what exists + the selector for each." Nothing is captured
one-shot.
- Reuse first:
node scripts/capture.mjs assets [--source <name>] [--query <substr>]
reads the catalog so you don't re-shoot what you already have.
- Capture by anchor (deterministic):
- Single element, retina still:
node scripts/capture.mjs still --url <u> --selector <css> --source <name> --id <assetId>
(element-exact locator.screenshot; animations frozen; ~retina. --clip x,y,w,h /
--full / viewport remain as fallbacks for canvas / non-anchorable UI only.)
- A whole repeating grid in ONE attach:
node scripts/capture.mjs cards --url <u> --selector <css> [--scroll-selector <css>] [--scroll-step <px>] [--max <n>] [--no-contact] --source <name> --id-prefix <p>
captures EVERY match of --selector, amortizing one CDP attach over all N cards.
For virtualized grids it steps --scroll-selector (the inner overflow container,
not the window) in --scroll-step increments and collects newly-mounted matches
each step, deduped by innerText hash. Each card → <id-prefix>-NN.png cropped at its
measured boundingBox() (recorded as the catalog bbox — evidence, not a guess) and
all crops are montaged into <id-prefix>-contact.png.
- Crisp motion:
node scripts/capture.mjs record --url <u> --seconds <n> --motion <scroll|into-view|hold|zoom> [--target <css>] [--logged-out] [--webm] --source <name> --id <assetId>
(emits <id>.mp4, retina, CRF18 two-pass. --logged-out records via a FRESH headless
browser — a clean public-page session with NO logged-in chrome — for a PUBLIC marketing
shot, still retina (logged-in is the default; see Auth/data sensitivity). --webm ALSO emits a VP9 <id>.webm —
default OFF: no visible gain, larger + ~2x encode; Remotion decodes the mp4 fine.)
- All upsert
shared/asset-library/<source>/catalog.json (single writer → never
drifts; preserves any curated views[]).
- QC via ONE contact sheet (project habit — never "watch the video", and never N
per-card Reads):
cards hands you a single montaged <id-prefix>-contact.png. Read it
ONCE and reject bad cells. For a still/record, read the one captured still / first-
mid-last frames. Right view? Crisp text (not VP8 mush)? No PII band? A clean exit that
landed on a login wall or empty state is still a FAIL.
Site memory (site-map.json)
A per-site learned anchor cache the agent reads/writes —
shared/asset-library/<source>/site-map.json (data, not code). Read it FIRST, before
opening Playwright MCP. It records, per view: the URL, each anchor's selector + role +
count/bbox + confidence + lastVerified, plus nav (label→url) and piiZones (the
top nav/header band that holds account email + balance — anchor BELOW it).
- Reuse high-confidence anchors WITHOUT re-snapshotting. Only explore views/anchors
that are unknown or stale.
- Cheap-revalidate before trusting:
node scripts/capture.mjs check --url <u> --selector <css> (token-free probe → {count, firstBbox, visible, ok}). Confirm
ok:true (and count matches) before you capture against a cached selector.
- Snapshot once PER PAGE and extract ALL that page's anchors in one pass — not one
snapshot per asset.
- Write new/updated anchors back, evidence-stamped: selector, role, count/bbox,
lastVerified. On drift (a check returns ok:false or a changed count), re-explore
that ONE anchor and update its lastVerified + confidence. Never fake a lastVerified.
- Goal: confirm less each run. The site-map is a cache, not a source of truth.
Storage layout — one folder per product (reproducible contract)
Every product gets ONE home: shared/asset-library/<product>/, and capture.mjs writes
there by role automatically — stills/, cards/, motion/, plus _contact/ (QC
sheets) — with catalog.json (single-writer machine inventory) and site-map.json (anchor
cache) at the root. This layout IS the contract: re-running the node on the same site
reproduces the same structure and the same kinds of assets — the role folders and catalog
shape don't change run to run.
The product folder is the home for all of that product's surfaces, not just captures.
Generated / downstream surfaces live here too as sibling folders — e.g. frames/ for
Remotion-rendered frames, processed/ for any compose-time crops/masks. Same folder,
different surfaces. Archived pre-anchor experiments go under _legacy/ (moved, never
deleted; never cataloged).
Efficient crop (anchor-first) — why this, not the regrid loop
Per the brief (shared/research/efficient-element-asset-capture-2026-06-01.md): stop
asking vision for coordinates. locator.screenshot() auto-scrolls to the element and
clips at its true rendered bounds — pixel-exact, cropped at capture, no grid overlay, no
eyeballing. Measuring each card's live boundingBox() and clipping per-card (with a
pinned deviceScaleFactor) kills the DPR / fractional-layout artifacts (the 1190≠1082
pitch bug, the banner-shift) at the source. Batch all matches through ONE attach (cards)
so N cards share one connect — the whole point of the verb. Then ONE vision pass over the
contact sheet replaces N regrids + N Reads. Set-of-mark / "ask the VLM for boxes" is the
documented failure mode (54% hallucinated boxes on dense pages) — don't reintroduce it.
Motion: when it beats a still, and which preset
- Most "motion" is the camera, not the site — and that's honest, we supply it:
scroll — slow eased pan down a page (or to --target).
into-view — bring --target to center, then dwell.
zoom — gentle push toward --target.
hold captures the page's OWN in-place animation. Use it only when the element
actually animates — verify first (a 1–2s hold test, then look). If the page is
static, do NOT fabricate motion: use a camera move or a still, and flag it.
- Prefer a still when the value is a single settled state (a stat, a config panel, a
list). A few decisive frames beat a long noisy scroll.
Best-practice recipes (2026 research — baked into the tools; mirror them in exploration)
- Perception: snapshot the a11y tree (interactive-only, compact), act by ref;
annotated screenshot only for icon-only / canvas / visual-state cases.
- Stabilize before capture:
domcontentloaded + a bounded ~4s networkidle wait
(console SPAs stream telemetry and never go fully idle — a hard networkidle wait burns
~45s/capture for nothing), then document.fonts.ready + in-viewport image decode (no
mid-reflow text, no blank placeholders), a known-heading wait, dismiss cookie/login
overlays, freeze animations for stills. capture.mjs does all this in navigateAndSettle.
- Retina everywhere: deviceScaleFactor 2 — pinned via CDP
Emulation on BOTH the
CDP-attached and the --logged-out launched path; stills + motion come back ~3840×2160.
- Recording is two-pass: live
x264 -preset ultrafast -crf 18 (must beat realtime so
CDP never backpressures) → deliverable mp4 (x264 CRF18). A VP9 webm is opt-in via
--webm (default OFF: no visible gain over the mp4, inconsistent size, ~2x encode time;
Remotion OffthreadVideo decodes the mp4 fine). NEVER the built-in VP8 recordVideo
(mosquito-noise text). yuv420p + color_range tv so Remotion decodes at true levels.
Auth / data sensitivity / safety
Capture the REAL, logged-in UI by default — show the true form of the data. This is the
operator's OWN product and OWN account; they see this data anyway, and empty logged-out states
make worse assets (no real rows, no real numbers). Do NOT reflexively log out, blank panels, or
anchor around the account chrome just to dodge ordinary account info.
- Fine to show (do NOT scrub): account email, name, avatar, balances, spend/usage history,
dashboard numbers, model lists. This ordinary account data is exactly what makes a demo read
as real.
- NEVER leak — mask or skip ONLY these live credentials: full API-key / secret-token
values, credit-card / payment numbers, passwords. If one is shown unmasked, frame to
exclude it, mask it, or skip that shot and flag it — never publish a working secret. (Most
consoles already mask keys, e.g.
sk-…**** — just verify before shooting.)
--logged-out is an OPTIONAL tool, not a sensitivity mandate — use it only when you
specifically want a clean PUBLIC marketing capture (the logged-out landing with Login /
Sign-Up). For anything where the real data IS the point, capture logged-in.
- Never fabricate footage — real site only. If a view is gated and the session is
dead,
ensure --reauth. If a view can't be reached, FLAG it as a gap; don't fake it.
Tooling map
scripts/capture.mjs — ensure | still | record | cards | check | assets (this skill's
mechanics). cards = batch anchor-first element capture (whole repeating grid, one
attach, one contact sheet). check = token-free selector probe (re-validate a cached
site-map anchor: {count, firstBbox, visible, ok}) — a PROBE, not a capture.
- Playwright MCP (
playwright server, --cdp-endpoint http://127.0.0.1:9222) — the
exploration/action tools the agent calls natively.
scripts/cdp-probe.mjs — minimal "attach + screenshot a URL" probe (handy sanity check).
scripts/capture-site.mjs — LEGACY public-page tool that launches its OWN browser
(probe/sections/full + storageState). For logged-in + motion, use the CDP-attach path
above; keep capture-site.mjs only for quick public-page stills.
references/video-quality-upgrade.md — the screencast→ffmpeg rationale.
references/library-and-manifest.md — asset-library layout + catalog conventions.