| name | explainer |
| description | ~60-80s explainer video for any URL — GitHub repo, product page, docs site, blog post, or launch. Canonical workflow for URL walkthroughs. Use when the user asks to "explain this URL / repo / website / product", "make a walkthrough video for [url]", "demo this site", "Loom-style explainer of [url]", "explainer for github.com/...", or "explain this product link". Drives a real browser through the URL, generates an avatar lipsync, and composites in a 1280×800 macOS Sonoma frame with a 246-pixel bottom-left avatar circle. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); other URLs use a generic page-walkthrough flow. |
| argument-hint | <url> [--focus angles] [--avatar url] [--voice id] [--live-url url] [--lipsync-provider pika|kling] [--no-captions] [--preview] |
/pika:explainer
Generate a ~60–80s URL explainer video: drive a real browser through the URL along a beat-sheet timeline, generate an avatar lipsync of the narration, and composite it all in a 1280×800 macOS Sonoma frame with a 240-pixel inner avatar (246-pixel outer including 3px white stroke ring) at canvas (20, 476) and element-targeted zoom on every mid-section beat. Works on any URL — product pages, docs sites, blog posts, launches. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); all other URLs use a generic page-walkthrough flow.
Usage: /pika:explainer <url> [--focus "angles"] [--avatar <url>] [--voice <id>] [--lipsync-provider pika|kling] [--preview] [--live-url <url>]
Behavior
Defaults — fire fast, no mid-flow confirmation
- Use identity-store defaults silently for avatar / voice. Never ask "should I use your avatar?" or "which voice?" before firing. Honor explicit overrides (
--avatar, --voice) when supplied; otherwise resolve via identity_avatar_url / identity_voice_id and proceed. See Step 1 for the full resolution waterfall (including the silent fallback when identity returns null).
- No mid-flow "type yes to proceed" gates by default. Step 5 preview is opt-in via
--preview (for power users testing new avatar/voice combos before the long-pole render); the default flow runs end-to-end without pausing.
- Do not solicit
--focus either. Make a confident first attempt from page structure; users re-run with --focus "X" if the angle missed.
These defaults match industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return. Account credit balance + provider failover (Step 9) are the canonical guardrails.
Local avatar images on Claude Desktop
Claude Desktop can't pass inline-pasted images to MCP tools yet (Anthropic-side limitation). If the user pastes a photo inline, or mentions a local file they want as --avatar, pause Step 1 and kindly send them this — something like:
Heads up — pasted images don't reach MCP tools on Claude Desktop yet (Anthropic limitation). Two easy options for your avatar:
- Paste a URL if it's already hosted (Imgur, S3, your site) — fastest
- Attach the image file so I can upload it before generation.
When a local file arrives, convert it to a public URL with upload_asset and use the returned public_url as --avatar <url> before Step 1. Already-hosted https://... URLs work as-is and skip this entirely. If no avatar is supplied at all, the identity-store default fires.
Step 0 — Resolve URL (empty-args menu)
Strip flags (--focus, --avatar, --voice, --live-url, --lipsync-provider, --no-captions, --preview, --skip-preview, --yes) and key=value parameters from $ARGUMENTS. If what remains contains no https://... URL (or is empty / whitespace-only), print this menu verbatim as your full response, then stop and wait for the user's next message. Calling a tool here risks recording or explaining the wrong page. If $ARGUMENTS already carries a URL, skip this step silently and proceed to Step 1.
Which URL would you like me to walk through? Works on any of:
- A GitHub repo — e.g.
https://github.com/anthropics/claude-code (activates repo-aware mode: README scan + live-demo detection)
- A product page / launch page — e.g.
https://pika.art
- A docs site — e.g.
https://docs.anthropic.com
- A blog post / article URL
Output: 1280×800 macOS Sonoma frame with a bottom-left avatar lipsync and element-targeted zoom on every mid-section beat. Default flow runs end-to-end with no confirmation gates — pass --preview if you want a 3-second lipsync sanity check first.
Reply with the URL and I'll start.
Tip: you don't need to type /pika:explainer — just say things like "walk me through ", "make a demo video of ", or "explain this repo: " and I'll fire this skill automatically.
When the user replies with a URL, treat it as the resolved input and proceed to Step 1. Do not re-prompt.
Step 1 — Parse input + detect mode
Required: url (must be https://...).
Optional: --avatar <url> (overrides identity-store default), --voice <minimax-voice-id>, --focus "..." (editorial guidance woven into vo_text), --live-url <url> (force-supply live demo URL — GitHub mode only), --lipsync-provider <pika|kling> (defaults to pika — parrot a2v, ~2-5 min wall-clock, slightly more dramatic head motion. Pass kling for tighter face-centered output at ~5-30 min wall-clock — Kling produces minimal-head-motion presenter shots but is the long-pole stage; reserve for high-stakes renders), --no-captions (skip the Step 11 caption burn — default is captions on), --preview (opt-in to the Step 5 preview gate — ~3s lipsync of "Hi, I'm your presenter" for testing new avatar/voice combos before the long-pole render; default is no preview). --skip-preview and --yes are accepted as no-ops for backward compatibility.
Mode detection:
- GitHub mode — URL host is
github.com AND path matches /{owner}/{repo} (no further path segments past the repo root). Activates the repo-aware extras: README scan, live-demo detection, GitHub-specific selectors.
- Generic-URL mode — anything else (a product page, docs site, blog post, deeper GitHub path like
/blob/HEAD/path). Skips the GitHub extras; uses generic CSS selectors and walks through the URL itself.
Avatar resolution (silent — never ask the user):
- If
--avatar <url> was passed, use it.
- Else call
mcp__pika__identity_avatar_url. If non-null, use it.
- Else (fresh user, no identity avatar set yet) call
mcp__pika__generate_image once with prompt "professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting" and use the returned URL. Do not ask the user "should I generate one?" — just generate silently.
Voice resolution (silent — never ask the user):
-
If --voice <id> was passed, use it.
-
Else call mcp__pika__identity_voice_id. If non-null, use it.
-
Else pick a casual MiniMax speech-2.8-hd preset matching the resolved avatar's apparent gender:
- Female-coded avatar →
English_PlayfulGirl (warm, casual, clearly female-voiced — verified)
- Male-coded avatar →
English_Jovialman (warm, casual male)
- Unclear / gender-neutral →
English_Jovialman (default)
Determine gender from mcp__pika__identity_persona_read (look for a gender / pronouns field) when identity exists; otherwise infer from the resolved avatar image. Do not call analyze_media for this — it's not worth the extra ~30s round-trip. Do not ask the user.
Do NOT use English_FriendlyPerson — despite being categorized under "female" in MiniMax's catalog, its display name is "Friendly Guy" and it reads as male in playback. English_PlayfulGirl is the canonical casual-female pick. Other verified-female alternates: English_Upbeat_Woman, English_LovelyGirl, English_radiant_girl.
The flow below is annotated per step: GitHub-only, Generic-only, or Both modes.
Step 2 — Read source (no MCP call)
Both modes: use Claude's WebFetch on the input URL to pull the page's main content (h1, hero section, headings, primary copy).
GitHub mode additions: also fetch top-level file tree, (best-effort) package.json / pyproject.toml, and GitHub API repo metadata via gh api repos/{owner}/{repo} for homepage, description, language, topics. Detect a candidate live_url in this priority:
- User-supplied
--live-url.
- GitHub API
meta.homepage field — set when the maintainer configured the repo's homepage in GitHub settings (matches tarball repo_analyzer.py:66-77).
package.json "homepage" field.
- First match in README of
https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]*.
- Any other URL in README that the badge area / "Live Demo" / "Project Page" / "Demo" text points at. The allowlist regex above misses arbitrary custom domains (e.g.
<project>-project-page.com); when the README explicitly designates a project page, prefer that over the github.io fallback.
- GitHub Pages convention
https://{owner}.github.io/{repo} — but only if the deep tree contains a frontend signal (one of index.html, App.tsx, App.jsx, App.vue, app.py, main.py).
If no candidate resolves, the beat sheet skips beats 6–7.
Generic-URL mode: the input URL itself is the only URL the beats walk through — no live_url inference, no extra metadata fetches. Skip Step 2.5 and Step 3.0; jump straight to Step 3.
Step 2.5 — Verify live_url reachability (GitHub mode only, no MCP call)
If a candidate live_url was selected, verify it serves real content before authoring beats 6–7. Use WebFetch on the candidate and check the response:
- If the response status is 4xx / 5xx, drop
live_url to None and skip beats 6–7. The github.io fallback in particular is reachable as a hostname but often returns 404 ("There isn't a GitHub Pages site here") for repos that haven't enabled Pages — recording that 404 page wastes ~12s of the explainer on wrong content.
- If the response renders the GitHub Pages "404 — There isn't a GitHub Pages site here." template (heuristic: response body contains
"There isn't a GitHub Pages site here"), drop live_url and skip beats 6–7.
- Otherwise, keep
live_url for beats 6–7.
This mirrors the original tarball's requests.head(live_url, timeout=6, allow_redirects=True) reachability gate.
Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)
Before authoring beats for a non-GitHub URL, WebFetch the input URL and inspect the response. This step prevents three common Generic-URL failure modes: (a) recording a captcha / bot-block page instead of content, (b) the cookie/consent banner eating the first ~3 seconds of video, (c) generic CSS selectors missing the page's actual hero / sections.
A. Bot-block / captcha detection — abort if matched:
If the response body contains any of:
"Verify you are human" / "verify you are not a robot"
"captcha" / "CAPTCHA" / "reCAPTCHA"
"403 Forbidden" / "Access Denied"
"Just a moment" + cf-chl-bypass (Cloudflare challenge)
"We're sorry, something went wrong" (Amazon-style bot block)
- A
<title> or h1 of just "Robot Check" / "Are you a robot?"
→ ABORT with a clear error to the user: "Generic-URL mode can't render this site — the page is showing a bot-detection / captcha challenge under headless Chrome. Try a different URL, or run a real-user version of the page first to verify it loads cleanly."
B. Cookie / consent-banner detection — defuse with extra_css + optional click:
Scan the response for these patterns (case-insensitive):
- IDs / classes starting with
onetrust-, truste-, cookie-banner, cookie-consent, gdpr-, consent-, cmp-
- Buttons matching
(?i)accept (all )?cookies / (?i)agree.{0,10}cookies / (?i)i (accept|agree)
- Apple-specific banner: id
ac-gdpr-banner or class as-globalfooter-curtain
- Google consent:
[role="dialog"] with text "Before you continue"
If detected, set cookie_banner_present = true. Defense in depth — the recording uses BOTH:
- CSS injection (
extra_css) in the capture_website call to hide common banners universally — even if the click below misses, the banner is visually gone.
- A
click timed_action at at_s: 0.0 against the most likely dismissal selector (extracted from the WebFetch DOM, e.g. #onetrust-accept-btn-handler, [aria-label*="Accept all" i], button[id*="accept"]).
The extra_css payload (use this verbatim — covers ~80% of consent platforms):
#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; } /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }
C. Real-DOM element identification — emit concrete selectors:
Generic CSS selectors (h1, [class*="hero"], section h2) work on semantic / well-marked-up sites but miss obfuscated class names on big-name corporate sites (apple.com uses tile-headline / as-headline-section-title, not hero-*). For each beat, prefer the actual DOM elements observed in the WebFetch:
- Read the rendered HTML/markdown WebFetch returned. Note the page's actual primary
<h1> text and class.
- Note the page's section structure (h2 headings + their parent containers).
- Note any prominent CTA / signup / pricing element.
- Emit
zoom_target.selector using the actual class or id observed, falling back to semantic structure (main > section:nth-of-type(N) h2) when class names look auto-generated (Tailwind _1a2b3c, CSS modules module__hero___xYz).
D. SPA / lazy-render detection — bump initial wait:
If the WebFetch response has fewer than 3 visible headings / minimal text content, the page may be SPA-rendered post-domcontentloaded. Emit a longer initial wait action ({type: "wait", at_s: 0.0, ms: 2500}) before any beat fires, instead of the default 600ms settle.
E. --focus is honored when supplied (do not solicit):
Without --focus, select beats from generic structure cues — proceed silently with a confident first attempt. Do not ask the user "what should I focus on?" before firing; users iterate by re-running with --focus "the X feature" if the first pass misses the angle they wanted. With --focus supplied, anchor beat selection on the phrase: uses concrete page sections that match it, ignores irrelevant marketing chrome.
Step 3.0 — Required README section scan (GitHub mode only, no MCP call)
Before authoring the beat sheet, scan the README (case-insensitive, full-text) for any of these section names. If a match is found, you must add a dedicated beat for that section in Step 3, replacing one of the generic beats 4–5 if necessary:
| README contains... | Required beat |
|---|
how it works | scroll_to that heading; zoom article h2:has(#user-content-how-it-works) |
audio layer / audio timeline | scroll_to the audio-layer diagram; zoom on the rendered figure or its surrounding heading |
claude code / mcp integration | scroll_to that section; zoom article pre or .highlight (terminal screenshot / code block) |
architecture / system design | scroll_to that section; zoom article h2:has(#user-content-architecture) |
features (when prominent at top) | scroll_to that heading; zoom article h2:has(#user-content-features) |
getting started / quick start / installation | scroll_to that heading; zoom article h2:has(#user-content-installation) (or the matching slug) — falls back to article pre if you want the install code block instead |
usage / examples | scroll_to that heading; zoom article h2:has(#user-content-usage) (or the matching slug) — or the first code block under it |
GitHub heading slug rule: lowercase, spaces → dashes, strip non-[a-z0-9-] characters. So "How it works" → #user-content-how-it-works, "Quick Start" → #user-content-quick-start. GitHub injects the <a id="user-content-{slug}"> anchor inside each rendered <hN>, so hN:has(#user-content-{slug}) reliably grabs the heading element across any GitHub README.
Selector contract: bbox_selector needs to be vanilla CSS that resolves via document.querySelector (capture_website runs the post-action smooth-scroll JS via page.evaluate, which uses the browser's native selector engine). Avoid Playwright extensions like :has-text("..."), text=..., or :visible: those resolve in Playwright's page.query_selector (so the bbox capture finds the element) but silently fail in the smooth-scroll's document.querySelector (so the page never scrolls to the target, and bbox.y ends up at document-Y instead of top - 60 px, which trips Step 8b's bbox.y > recording_viewport.h degenerate filter and falls back to default-position zoom). CSS Level 4 :has(...) is vanilla and supported in modern Chromium.
These sections are the highest-information visuals in most explainer-worthy repos. Missing them produces a generic walkthrough; including them gives the explainer a concrete "show, don't tell" beat. The original tarball SKILL.md flagged the first four with SPECIAL rules in the Gemini prompt; this Step 3.0 promotes them from incidental guidance to a hard requirement and adds three more high-signal headings common in OSS READMEs.
Step 3 — Author beat sheet (main thread, no MCP call)
Write a JSON array of 8–10 beats, with a hard total duration of 65–80 seconds and a hard total word count of 165–200 words (assuming a speaking rate of 2.5 words/sec). Each beat:
{
"t_start": 0.0,
"t_end": 7.5,
"action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
"zoom_target": { "selector": "...", "description": "..." },
"vo_text": "exact words to speak — 1 to 2 conversational sentences"
}
Hard constraints (validate before emitting the beat sheet — reject the draft if any fails):
- Every beat needs all five fields:
t_start, t_end, action (with type and url), zoom_target (with selector), vo_text. Missing fields ⇒ reject and re-author. (Mirrors tarball's github_explainer.py:183-190 validation pass.)
t_start of beat 0 = 0.0; t_end[i] == t_start[i+1] (continuity).
len(vo_text.split()) / 2.5 ≈ t_end - t_start per beat. Aim for ±10% of this estimate; if your draft is denser than 2.5 wps, tighten the vo_text until it fits.
- Total
t_end of last beat ≤ 80 seconds. (Reference output is 86.5s including intro; lipsync audio is ~83s. Kling avatar/image2video stalls reliably past ~90s of audio under current load — going over 80s risks a 20-min Kling timeout.)
- Total spoken word count between 165 and 200 words.
- Every beat's
zoom_target.selector needs to be a valid CSS selector for the page that beat lands on. GitHub mode prefers GitHub-specific selectors: h1.f1, #readme, article h2, .blob-code-inner, .highlight, .octicon-star, nav. Generic-URL mode prefers robust generic selectors: h1, [role="main"], main, header, nav, .hero, .feature, section h2, [class*="cta"], [class*="hero"], button, a[href]. Selectors need to resolve on the rendered page after the beat's action settles — verify against the DOM you can see via WebFetch before emitting.
vo_text is 1-2 conversational sentences. Dev voice. No stage directions. No markdown.
action.url is a valid https://... URL when action.type == "navigate"; required.
Self-check before Step 4: verify total_words is in [165, 200] AND total_seconds (= beats[-1].t_end) is in [65, 80]. If either misses bounds, re-author the beat sheet — do not proceed to TTS. (No need to "print" anywhere — this is an internal draft validation; just reject the draft and re-author until it passes.)
Structural skeleton — GitHub mode (load-bearing for the visual contract — match origin, but Step 3.0 overrides if applicable):
- Beat 1:
navigate repo root, zoom h1.f1 (repo title), hook sentence.
- Beats 2–3:
navigate to specific source files (https://github.com/{owner}/{repo}/blob/HEAD/<path>), zoom .blob-code-inner or .highlight. Pick files that match the narration's claim — don't navigate to a file you won't talk about.
- Beats 4–5:
scroll_to README sections, zoom article h2 or #readme. If Step 3.0 surfaced required sections, replace these slots with the required ones.
- Beats 6–7 (only if
live_url survived Step 2.5): navigate to live_url, zoom nav / h1 / .hero / main / button / .feature.
- Beat 8: back to repo root, zoom
.octicon-star, outro.
Structural skeleton — Generic-URL mode:
- Beat 1:
navigate to the input URL, zoom h1 or [class*="hero"] h1 (the page's primary headline), hook sentence.
- Beats 2–3:
scroll_to the page's hero / value-prop / first feature section. Zoom .hero, [class*="hero"], [class*="feature"], or section:nth-of-type(1) h2. Pick visible elements the narration references.
- Beats 4–5:
scroll_to deeper sections — feature lists, screenshots, pricing, social proof. Zoom section h2, [class*="feature"] img, [class*="testimonial"], [class*="pricing"], or any prominent semantic element on the page.
- Beats 6–7:
scroll_to CTA / signup / demo embed. Zoom [class*="cta"], button, a[class*="button"], or [id*="signup"]. (No live-demo navigation in generic mode — the input URL IS the demo.)
- Beat 8:
scroll_to footer / closing element, zoom footer h2, footer, or back to top with h1. Outro sentence.
If --focus is supplied, weave its angles into vo_text without mutating the structural skeleton. Prefer CSS selectors over text_content in zoom_target.selector — bbox capture is selector-only (see Known gaps).
Step 4 — TTS
Call mcp__pika__generate_speech with provider: "minimax-tts", text: <full vo_text join>, optional voice_id. Capture result.audio_url (the dispatcher returns audio under audio_url, not url) and result.duration_seconds. Voice defaults to identity-store injection in plugin mode.
Stale-voice fallback detection (AGNT-231): the dispatcher retries once with the default Calm_Woman voice on Minimax status_code:2054 (voice id not found — typically a per-agent workspace pointer that Minimax auto-deleted after 7 days of inactivity). On retry success the response carries two extra fields beyond the documented schema (passthrough): voice_id_requested (the planted-but-stale id the worker tried first) and fallback_reason: "invalid_minimax_voice_id". If you see fallback_reason == "invalid_minimax_voice_id" in the response, surface a one-line note to the user along the lines of: "your registered voice expired on Minimax (auto-GC'd after 7 days of inactivity); we used the system default. Re-clone via clone_voice if you want personalization back." The render does NOT fail — it just uses the default voice — so this is informational, not a retry trigger.
Cookie-banner audio padding (Generic-URL mode with cookie_banner_present == true from Step 2.6 §B): prepend MiniMax's pause marker <#1.5#> to the text: argument before calling generate_speech. MiniMax's speech-2.8-hd honors <#N#> as N-second silence; the returned audio_url and duration_seconds include the 1.5s lead-in natively. This aligns the audio with the screen recording's cookie-dismissal +1.5s offset applied in Step 4.5.
Fallback (only if smoke-test shows the marker is ignored on this voice): call generate_speech normally, then mcp__pika__edit_audio_mix to overlay the result onto a 1.5s silent base at offset 1.5s. Then call mcp__pika__analyze_media(url=<padded_audio_url>) to probe the padded duration and rebind duration_seconds = result.duration_seconds before Step 4.5 consumes it. analyze_media is the single authoritative duration probe — do not rely on edit_audio_mix's return payload (its duration field is not contractually guaranteed).
Step 4.5 — Audio length verification + beat-sheet rescale
Applied to audio_duration_seconds post Step 4 (which includes any cookie lead-in pad). End state: beats[].t_start / t_end are absolute wall-clock seconds matching the audio playback timeline. All beats[] mutations happen here; Steps 6 and 8 are read-only consumers.
Gate 1 — Kling stall ceiling (provider cap, raw audio_duration_seconds):
If audio_duration_seconds > 90, abort and re-author the beat sheet with a tighter word budget. Kling avatar/image2video stalls past ~90s.
Gate 2 — Degenerate TTS (spoken-content length):
Compute narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0). If narration_duration < 30s, retry Step 4 once (and recompute narration_duration from the retry's audio). If the retry also returns narration_duration < 30s, abort and investigate — likely failure modes: truncated MiniMax response, silent audio, vo_text not joined correctly.
Gate 3 — Rescale:
narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)
scale = narration_duration / beats[-1].t_end
- If
scale < 0.5 or scale > 1.5, abort and re-author. Structurally broken TTS (or wildly off word budget); rescaling won't save it.
- For each beat:
beat.t_start *= scale; beat.t_end *= scale
- If
cookie_banner_present: for each beat, beat.t_start += 1.5; beat.t_end += 1.5
- Final clamp:
beats[-1].t_end = audio_duration_seconds (exact). Guarantees float equality of the invariant regardless of cookie mode or accumulated float drift.
After Gate 3 passes, emit a one-line operator log to surface the scale value for post-run diagnosis:
Rescaled beats by scale=X.XX (audio=Y.YYs, narration_duration=Z.ZZs, cookie_pad=W.Ws)
Advisory (not a gate): scale near 1.0 is ideal. scale > 1.2 means audio is meaningfully slower than predicted — visuals feel "stretched" but stay in-sync. scale < 0.85 means audio is faster — visuals feel "rushed" but in-sync. Both pass the gates; if the user reports "feels off-pace" rather than "out of sync," re-author with a tighter / looser word budget.
Step 5 — Preview gate (opt-in via --preview)
Skip Step 5 entirely by default. Proceed directly to Step 6 unless the user explicitly passed --preview — do not generate a preview, do not ask for confirmation. This matches industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return; account credit balance + provider failover are the canonical guardrails.
--skip-preview and --yes are accepted as no-ops for backward compatibility — they were the old opt-out flags.
If --preview was supplied:
-
mcp__pika__generate_speech with text: "Hi, I'm your presenter. Let's explore this repo together." → preview_audio_url.
-
mcp__pika__generate_lipsync with provider: <resolved_lipsync_provider> (defaults to pika; honor --lipsync-provider kling if supplied), image: <avatar>, audio: preview_audio_url → preview_lipsync_url (bare lipsync, ~3s). Use the same provider here as Step 9 will use for the full audio — the preview's job is to confirm the avatar+voice+provider combo before the long-pole render.
-
Present to the user verbatim:
Preview ready: <preview_lipsync_url>
This confirms the avatar + voice combo. The full render is a long pole (~5–30 min Kling lipsync on the full audio).
Reply yes to proceed, or anything else to cancel.
-
Match ^(yes|go|proceed|confirm|y)$ (case-insensitive). Anything else → STOP, no further MCP calls.
Step 6 — Build timed_actions and record
Translate the beat sheet into capture_website timed_actions. One timed_action per beat — set bbox_selector to the beat's zoom_target.selector and capture_website captures the post-action bbox of that element internally (legacy 600 ms settle → smooth-scroll-to-top - 60 px → 1300 ms post-anim → measure, all server-side).
For each beat in order, emit one entry:
navigate beats: {type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}. The worker navigates, waits to absolute at_s + 0.6 s, scrolls bbox_selector into view, and measures the bbox — all without the caller scheduling a follow-up step.
scroll_to / hover beats: {type: "scroll", at_s: <t_start>, selector: <action.selector or zoom_target.selector>, bbox_selector: <zoom_target.selector>}. The action's own selector drives the page scroll; bbox_selector drives the bbox measurement (it can be the same selector or different — usually the same). (capture_website has no hover; scroll-into-view is the analog.)
Do NOT prepend the eight-step intro scroll-through that the tarball ran. The lipsync audio is timed from t=0 of the beat sheet; a prepended intro shifts the screen recording forward by ~3 s while leaving the audio un-shifted, causing audio/video desync. The capture_website recording begins at t=0 with beat 0's URL already loaded — that's the orientation the tarball's intro scroll provided, minus the desync.
Call mcp__pika__capture_website:
url: <beat 0's action.url>
timed_actions: <the N-element list built above> (one entry per beat)
duration_s: ceil(audio_duration_seconds) — beats[].t_start and t_end have already been rescaled to the TTS audio timeline by Step 4.5, so duration_s is simply the audio length. The old max(...) defense against TTS overrun is no longer needed.
Generic-URL mode additions (per Step 2.6 pre-flight):
extra_css: <the cookie-banner-hiding CSS payload from Step 2.6 §B> — defensive: hides common consent platforms via display: none !important; so even if the optional click misses, the banner is invisible in the recording.
- Prepend a
wait action {type: "wait", at_s: 0.0, ms: 2500} for SPA / lazy-render pages (per Step 2.6 §D); use 1500ms for "normal" pages. This gives time for hero images to lazy-load, fonts to swap, and scroll-triggered animations to be ready before the first beat fires.
- If
cookie_banner_present from Step 2.6 §B, also prepend a click action {type: "click", at_s: 0.5, selector: <detected dismissal selector from WebFetch DOM>}. The beats[] array has already been shifted by +1.5s in Step 4.5 to account for the cookie-dismissal lag, and the TTS audio has already been padded with 1.5s of silence (Step 4); no further shifting is required here. Beat 1's timed_action.at_s reads beats[0].t_start directly, which is 1.5 in cookie mode.
- No cookie banner action needed if
cookie_banner_present == false; just the prepended wait action.
Capture video_url, recording_viewport, action_bboxes. The result returns recording_viewport: {w, h} and action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}] alongside video_url.
action_bboxes[].idx semantics: the idx field is the position in the input timed_actions array.
- GitHub mode: with one timed_action per beat,
idx maps 1:1 to beat index — Step 8 uses entry.idx directly as beat_idx.
- Generic-URL mode: the prepended
wait (and optional cookie-dismissal click) shift the array by 1 or 2. Compute beat_idx = entry.idx - prepend_count where prepend_count is 1 (wait only) or 2 (wait + click). Skip entries where beat_idx < 0 (those are the prepended setup actions, not beats).
The selector field on each entry reports bbox_selector (i.e. zoom_target.selector), not the action's own selector.
Step 7 — Browser chrome
mcp__pika__edit_browser_frame:
video_url: <Step 6 video_url>
url: (live_url if GitHub-mode and survived Step 2.5 else input_url, truncated to 65 chars)
tab_title: <30-char title> — GitHub mode: (meta.description or repo_name or "")[:30]. Generic-URL mode: the page's <title> (from WebFetch in Step 2) or the URL's hostname, truncated to 30 chars. Guard against None/empty.
Returns framed_url (1280×800 Sonoma + chrome).
Step 8 — Build zoom_keyframes and apply
Constants:
INTRO_BEATS = 2 — gates by beat-sheet index. Skips zoom on beat indices 0 and 1 ("Beat 1" and "Beat 2" in the structural skeleton above).
HOLD_GAP = 0.6 — seconds of 1.0× before each zoom-in and after each zoom-out.
MIN_BEAT_DUR = 1.5 — beats shorter than this are skipped (no room for a meaningful zoom).
SCALE = 1.35 (precise element-targeted zoom).
FALLBACK_SCALE = 1.25 (default-position fallback when no usable bbox).
FALLBACK_RAMP = 0.4.
Note: beats[].t_start / t_end were rescaled (and cookie-shifted if applicable) to the audio timeline by Step 4.5. HOLD_GAP (0.6s), MIN_BEAT_DUR (1.5s), and the 1.0s interior-interval check all operate on those final values — they are real visual seconds on the rendered video.
edit_browser_frame's inner-content offsets: CONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637 (verified against the worker's edit_browser_frame/main.py).
Coord transform (recording px → framed px):
cx_framed = 56 + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637 / recording_viewport.h)
Build the zoom list with a per-beat default + bbox override pattern. The legacy rig followed an "every non-intro beat gets a zoom — bbox-derived if available, default-position otherwise" rule. Reproduce that here:
Step 8a — Pre-fill default-position keyframes for every non-intro, long-enough beat.
Constants for the default position:
DEFAULT_CX = 56 + 1168 // 2 (screen center of the framed canvas)
DEFAULT_CY = 108 + 637 // 3 (upper-third of the content area, where most GitHub UI prominence lives)
Walk the beat sheet from index INTRO_BEATS (= 2) to the end. For each beat:
- If
t_end - t_start < MIN_BEAT_DUR (1.5s), skip — too short for a meaningful zoom.
- Compute the keyframe's interior interval as
[t_start + HOLD_GAP, t_end - HOLD_GAP]. If that interval is shorter than 1.0s, skip.
- Otherwise pre-fill that beat's slot in a per-beat map (call it
zoom_keyframes_by_beat[beat_idx]) with {cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)} plus the trimmed t_start/t_end.
Step 8b — Override with bbox-derived precise zoom where action_bboxes provided a usable measurement.
For each entry in action_bboxes:
beat_idx = entry.idx (since Step 6 emits one timed_action per beat). If beat_idx < INTRO_BEATS, skip.
- If
entry.found is false, skip.
- If the beat isn't already in
zoom_keyframes_by_beat (was filtered out in Step 8a by MIN_BEAT_DUR/1.0s rules), skip.
- Filter degenerate bboxes: skip if
bbox.y > recording_viewport.h (offscreen capture — page didn't scroll the element into view in time) or bbox.h > recording_viewport.h * 1.5 (full-page <main> element — yields a meaningless zoom center).
- Compute
cx_framed/cy_framed from the bbox center using the recording-px → framed-px transform shown above. Override the beat's slot with {cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}.
Final list: sort the values of zoom_keyframes_by_beat by t_start to produce the zoom_keyframes array.
This guarantees every non-intro, long-enough beat gets a zoom — precise when bbox capture worked, default-positioned otherwise. Avoids the "flat video for the whole runtime" failure mode.
If len(zoom_keyframes) > 0, call mcp__pika__edit_animate_zoom with video_url: framed_url, zoom_keyframes. Returns zoomed_url. Otherwise (no qualifying beats — should be rare given Step 3's 65-80s constraint) skip and use framed_url as zoomed_url.
Step 9 — Lipsync the full audio
mcp__pika__generate_lipsync:
provider: <resolved_lipsync_provider> — default: pika (parrot a2v). Honor --lipsync-provider kling if explicitly passed.
image: <avatar>
audio: <Step 4 audio_url>
- kling-only knobs (since
pika-mcp-server BACK-339, 2026-05-10): when provider == "kling", add mode: "pro" and prompt: "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter" for the polished-presenter feel. Both are silently ignored on pika (parrot has its own driver).
Provider tradeoffs:
| Provider | Wall-clock | Head motion | When to use |
|---|
pika (default) | ~2–5 min | Slightly more dramatic, naturalistic | Default for most runs — fast iteration, watchable output, ~10× faster than kling |
kling (opt-in) | ~5–30 min | Minimal, face-centered, presenter-style | High-stakes renders where the avatar must read like a polished presenter; tolerate the long pole |
Server-side-await covers the call inline; if the response shape is {task_id, status: "queued"}, poll mcp__pika__task_status in a tight loop (no sleep) until the status reaches a terminal state (completed, failed, or cancelled). On completed, capture lipsync_url. On failed / cancelled, fall back to the other provider (kling ↔ pika) per the failover note below.
Failover:
- If
pika fails (rare — parrot a2v is robust at typical explainer audio lengths) → retry once with provider: "kling".
- If
kling stalls past the worker's 1200s ceiling (visible as repeated processing status with no completion) → fall back to provider: "pika". Step 4.5's audio-length gate should catch the long-audio case before it gets here, but the failover handles the residual risk.
Why pika is the default:
- Speed — typical explainer wall-clock drops from ~10–15 min to ~5–7 min total because lipsync is the long pole.
- Quality is good enough — parrot a2v is naturalistic; the slight extra head motion reads as engaging rather than distracting in a 60-80s clip with avatar circle PiP.
- Kling-mode-pro polish is mostly invisible inside the 246-pixel circle anyway — face area is too small for the minimal-head-motion difference to register on most viewers.
For the canonical "polished presenter" feel of the original tarball reference output, pass --lipsync-provider kling explicitly.
Step 10 — PiP composite
mcp__pika__edit_pip:
main_video_url: <zoomed_url>
overlay_video_url: <lipsync_url>
shape: "circle"
size_px: 246 ← pixel-pinned 246px outer diameter (240 inner avatar + 3+3 stroke ring); matches tarball's CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2
stroke_width_px: 3
stroke_color: "white"
position_px: {x: 20, y: 476} ← 800 − 246 − 78 for dock clearance (matches tarball's H − CIRCLE_OUT − 78)
Pass size_px, not size; the fields are mutually exclusive. Returns final_url.
Master-duration / audio-source contract (matching tarball github_explainer.py:418-419, 531-533, 578-582): edit_pip uses shortest=1 semantics by default, which means the composite's duration is the shorter of (zoomed screen recording) and (lipsync video). Step 6's duration_s = ceil(audio_duration_seconds) ensures the screen recording length matches the lipsync exactly (Step 4.5 rescaled beats to the audio timeline). The composite duration is set by the lipsync via edit_pip's shortest=1 semantics. Audio comes from the lipsync video's audio track (the lipsync embeds the original TTS audio); the standalone audio_url is not re-mixed. If the lipsync video is shorter than the screen recording (Kling sometimes trims trailing silence), the screen will get cut off at the lipsync end — accept this; the alternative (looping the screen) is worse for explainer content.
Step 11 — Burn captions
Call mcp__pika__add_captions(video_url=<final_url>, style="classic"). classic renders a bottom subtitle bar — the right register for an explainer video (use tiktok / hormozi / karaoke only when the user explicitly asks for word-level highlight). The audio is extracted server-side from the PiP composite's lipsync track, so transcription matches the narration verbatim. Capture the result as captioned_url.
Skip this step only if the user passed --no-captions (parsed in Step 1) — the default is captions on. (Note: /pika:podcast does not burn captions — narration in an explainer is more transcription-friendly than fast two-host dialogue.)
Step 12 — Return
Emit captioned_url (or final_url if Step 11 was skipped) on one line: Done: <url>.
Load-bearing phrases
These anchors preserve the visual contract across page types:
| Phrase | Where | Why load-bearing |
|---|
vanilla CSS that resolves via document.querySelector | Selector contract | Keeps scroll, bbox capture, and zoom targeting aligned inside capture_website. |
GitHub URLs activate repo-aware mode | Mode detection | Prevents generic product-page beats from replacing README/code walkthrough beats. |
8-10 beats, 65-80 seconds, 165-200 words | Beat-sheet authoring | Keeps narration, screen recording, lipsync, and captions within the reliable duration envelope. |
all beats[] mutations happen here | Audio rescale step | Ensures later capture/zoom/composite steps consume one stable timeline. |
extra_css cookie-banner hiding payload | Generic URL pre-flight | Reduces first-frame banner occlusion when a banner click misses. |
Engine choice: Pika lipsync default, Kling opt-in
Default to Pika/parrot lipsync because it is faster and keeps most explainers in a short iteration loop. Use Kling only when the user explicitly requests --lipsync-provider kling or when a high-stakes render needs a more centered presenter look and can tolerate a much longer long-pole stage. Screen capture, browser frame, zoom, PiP, and captions remain deterministic edit/composite steps around that lipsync choice.
Runtime expectations
Typical wall-clock is 5-10 minutes with Pika lipsync, or 10-30+ minutes with Kling lipsync:
| Step | Wall clock | Notes |
|---|
| URL read + pre-flight | 10-60s | GitHub README scan or generic URL DOM/cookie checks |
| TTS + audio rescale | 30-90s | Beat timing is normalized after actual audio length |
| Screen recording | 60-180s | Depends on page load and navigation count |
| Browser frame + zooms | 1-3 min | Deterministic edit/composite stages |
| Lipsync | 2-5 min Pika / 5-30 min Kling | Kling is opt-in because it is the long pole |
| PiP + captions | 1-3 min | Captions skipped when --no-captions is set |
Known gaps (carried as follow-up server-side work)
Kling avatar mode:"pro" and prompt not exposed. Resolved by pika-mcp-server BACK-339 (PR #186, shipped 2026-05-10): generate_lipsync now wires both mode (e.g. "pro") and prompt end-to-end for the kling provider. To enable polished-presenter mode here, pass --lipsync-provider kling and the Step 9 call should add mode: "pro" plus a prompt like "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter". Real quality lever for reducing dramatic head motion in the lipsync — no longer a server-side gap.
- No caller-controlled white-frame trim on the screen recording.
capture_website has internal trim heuristics but doesn't expose them to the caller. Visible as a brief white flash at the start of the explainer when the page is still loading. The 800ms wait action at at_s: 0.0 mitigates this somewhat by giving the page time to paint, but doesn't trim already-recorded white frames. Worker enhancement.
- No
networkidle wait on per-beat navigation. Tarball uses page.goto(url, wait_until="networkidle", timeout=20000) plus wait_for_timeout(600) after every navigate. capture_website settles to domcontentloaded plus the bbox-capture branch's 600 ms post-action settle (server-side, when bbox_selector is set), but SPA blob pages whose final render happens after domcontentloaded can still get bbox'd against unmounted code blocks. Worker enhancement: expose a wait_until knob on timed_actions[].navigate.
- No per-step output-size verification gates. Tarball's
verify() helper at github_explainer.py:35-39 checked TTS ≥ 50KB, preview ≥ 100KB, screen ≥ 200KB, lipsync ≥ 500KB, final ≥ 1MB after each step. The MCP path returns URLs only; verifying file size would require an extra mcp__pika__analyze_media call per step (~30s overhead each). Worth adding once user-side latency budget allows it. For now, a downstream-failure cascade (e.g. zero-byte TTS → silent lipsync → blank composite) only surfaces at Step 11.
text_content bbox capture not implemented. capture_website v1 returns action_bboxes only for steps with a CSS selector. text_content-only steps produce no entry. Prefer CSS selectors in zoom_target for guaranteed zoom coverage.
- Beat-sheet wording is non-deterministic. Running the same input twice produces different vo_text and different zoom positions. Visual kind is the contract, not pixel-exact reproduction.
- Generic-URL mode quality varies by site. Modern indie / SaaS landing pages with semantic markup (
<h1> + clear <section> + named class hooks) work well. Big-name corporate sites (apple.com, microsoft.com, amazon.com) hit several known limits: (a) bot detection — the page may serve a degraded version under headless Chrome, or a captcha; Step 2.6 §A aborts on these but the heuristics aren't exhaustive; (b) obfuscated class names — tile-headline instead of hero-title defeats generic selectors; Step 2.6 §C's WebFetch DOM scan helps but isn't perfect; (c) scroll-triggered animations don't play — IntersectionObserver-driven hero reveals fire on real user scrolls, not Playwright's scrollIntoView; the recorded frame may be a static placeholder; (d) lazy-loaded images — picture/source elements with loading="lazy" may not have resolved by the 600ms-or-2500ms settle window; the bbox lands on a transparent placeholder. Workarounds: prefer simpler / smaller marketing pages for launch demos, always pass --focus "the X feature" to anchor beat selection, accept that big-name sites need a follow-up server PR (cookie-banner click retry + wait_until=networkidle + animation-trigger via IntersectionObserver polyfill).
- Cookie-banner click is single-attempt. Step 2.6 §B emits one
click against the dismissal selector extracted from the WebFetch DOM. If the WebFetch's HTML doesn't include the banner (rendered post-JS) or the selector is wrong, the click silently misses — the extra_css payload is the load-bearing defense. Worker enhancement: support a list of fallback selectors per click action so the worker tries each in order.
- Step 8b ↔ Step 6
idx-mapping mismatch in Generic-URL mode. Step 6 maps beat_idx = entry.idx - prepend_count, but Step 8b uses beat_idx = entry.idx naively. Pre-existing bug independent of rescaling.
Auth
If any call returns 401: the user's OAuth token has expired or hasn't been issued. The next authenticated MCP call triggers OAuth automatically (browser opens for @pika.art Google login). For non-interactive environments, set MCP_AUTH_TOKEN.
Examples
GitHub-mode (repo-aware: README scan + live-demo detection):
/pika:explainer https://github.com/leigest519/OpenGame
/pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"
/pika:explainer https://github.com/openai/whisper --preview (opt-in to the preview gate when testing a new avatar)
Generic-URL mode (any non-GitHub URL — drives through the page directly):
/pika:explainer https://pika.art
/pika:explainer https://vercel.com --focus "the deployment workflow"
/pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins
/pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview