| name | video-composer |
| description | Orchestrate transcript-to-video composition for this Remotion/Hyperframes vlog system. Use when analyzing an input recording or transcript, splitting it into shots, defining the global visual spec, deciding which assets can be generated in parallel, and producing composer artifacts for downstream shot, motion, presenter, generated-video, caption, Remotion, or QC skills. |
Video Composer
Overview
Own the whole video before any specialist writes a clip. Read the transcript and source constraints, decide the narrative structure, define the shared production spec, and emit compact artifacts that downstream nodes can execute without inventing a different style.
For any new recorded-video job, first read VIDEO_GENERATION_WORKFLOW.md at the repo root. Treat it as the canonical entry contract for content preparation, inbox handling, timing, composition, build, QC, and feedback updates.
Inputs
- Transcript with timestamps, preferably word-level.
- New user recordings may be dropped into
shared/inbox/ with arbitrary filenames. Pick the newest plausible recording unless there are multiple new candidates; copy the selected file into shared/recordings/ under the lesson id before composing.
- Source video metadata: duration, resolution, fps, audio notes, and presenter framing.
- Timing source: word-level ASR, phrase-level cue map, explicit manual cue timestamps, or audio-pause alignment when ASR is unavailable. For this workstation, prefer the Omniscience Sherpa-ONNX ASR model at
/Users/tk/Desktop/Omniscience/agent/models/voice/asr via npm run asr:recording before using pause-only timing. Do not treat rough script timestamps as final timing for a real recording.
- Available component catalog, if present.
- User goal, audience, subject, and target style.
Core Rule
Think in detail; output only what the pipeline needs. Do not push every editorial thought into JSON fields. Use internal judgment to choose shots, assets, and transitions, then emit a small set of production artifacts.
Workflow
- Identify the thesis: write one sentence for what the whole video teaches or proves.
- Verify timing before shot splitting. If a real recording is present and only rough script timestamps exist, run
LESSON_ID=<id> npm run asr:recording when the Omniscience ASR runtime exists; otherwise use ffmpeg silence/pause boundaries to create an audio-derived approximation and mark the confidence.
- Split the transcript by idea responsibility, not sentence boundaries.
- Assign each shot a role: setup, handoff, proof, demonstration, generated insert, detail zoom, recap, memory hook, or transition bridge.
- Select the active design system before defining style. If
.agents/design-systems/registry.yaml exists, read it, load the active design-system skill, and combine it with the base skills. Timing review stays with the global ASR timing skill, not the design system. Disabled design systems stay unused unless the user explicitly switches.
- Define the global production spec before assigning assets. Include design-system id, type scale, color, component catalog, motion defaults, video-frame styling, presenter-safe zones, reserved content rectangles, and continuity rules.
- Run a layout-fit pass before any atom is assigned. Inventory content count, text length, safe zones, presenter dock, and minimum readable sizes; choose the simplest layout that fits. Do not force side-by-side layouts when a top-down, single-focus, or sequential build would avoid collisions.
- Decide the human/content relationship for each shot. Use the presenter when trust, energy, or direct explanation matters. Let the content canvas dominate when the viewer needs to inspect or understand visual material.
- Select creator effects only after the layout and presenter/content relationship are clear. Record purpose, owner, cue source, safe zones, audio/caption budget, fallback, and QC targets for every creator effect; reject effects whose only reason is style or energy.
- Create a caption and widget budget. Burned-in captions are rare emphasis, not transcript; widgets must orient, signal, compare, transform, inspect, or recap without duplicating captions.
- Create an asset dependency graph. Mark independent nodes parallel; mark transitions or generated clips that depend on neighboring shots as linear. Prefer no custom transition atom unless the transition itself teaches the relationship between shots.
- Emit shot briefs for downstream specialists. Each brief must include the global spec, neighboring shot summaries, layout decision, creator effect contract when applicable, input/output expectations, and dependency constraints.
- Emit a final assembly intent after all shot assets are specified. Keep final render JSON small and component-oriented.
Transition grammars
The four frame-driven transition grammars are catalog capabilities in the registry under transitionGrammars (shared/capabilities/video-production-registry.json). Discover them there, then select one per seam via a top-level transitions[] array on the plan — promo plans today, lesson plans later — with the shape {at, grammar, frames}, where at is the outgoing shot's index and an absent seam means a hard cut. They compose through Remotion <TransitionSeries>, which owns the overlap timing, and are content-agnostic: any grammar can sit between any two scenes.
Select by what the seam must teach (text quoted from the registry — keep this table in lockstep with transitionGrammars):
| grammar | useWhen | avoidWhen |
|---|
blur-reveal | The universal scene-in: the incoming scene comes into focus (blur high to 0) while the outgoing blurs and fades out. | The seam wants energy or motion continuity — a push-through or match-cut reads stronger than a soft dissolve. |
push-through | A 2D camera-push-through: zoom in until the outgoing layer blurs and scales out, then the incoming scene emerges through it (pairs with a 3D pushIn move). | The beat is a quiet hold-to-hold — the aggressive zoom over-dramatizes a calm transition. |
match-cut | Two clips meet at a fixed peak-velocity frame: the outgoing accelerates to a peak at the seam, hard cut, the incoming continues from that velocity and decelerates. | The two scenes share no continuous motion to match — the hard cut just reads as a jump. |
luma-wipe | An AE luma-matte look: a frame-driven soft-edged white-on-black mask sweeps across the outgoing layer to reveal the incoming underneath. | A hard cut or a focus-pull reads better — a feathered wipe can feel ornamental on a fast beat. |
Typical frames: ~14–20 frames (≈0.5–0.7s at 30fps). match-cut tends shorter (the hard cut is the trick); blur-reveal can run a touch longer. The validator warns if frames would consume more than a neighbor shot's whole length.
A transition must TEACH the relationship between shots — energy, motion continuity, or a reveal — never decoration. If no grammar's useWhen fits the beat (or its avoidWhen applies), leave the seam a hard cut.
Required Outputs
composition-brief.md: thesis, audience, structure, pacing, and visual motif.
global-style-spec.json: shared visual constraints all nodes must ingest.
section-map.json: shot list with role, transcript range, visual responsibility, and neighboring context.
asset-jobs.json: asset generation jobs and dependency graph.
transition-map.json: transitions between or inside shots, including which transitions depend on actual rendered assets.
final-render-plan.json: compact Remotion-facing plan after assets are ready.
Also record a concise caption/widget policy in the style spec or brief: frequency budget, reason test, and duplication rule.
Also record a concise layout policy in the style spec or brief: chosen template, why it fits the content, safe zones, minimum readable dimensions, and which decorative details are allowed only after fit is proven.
Also record a concise creator-effect policy in the style spec or brief: allowed effect purposes, owner defaults, cue timing policy, audio/caption budgets, fallback rule, and QC targets. Use shared/research/2026-creator-effects/composer/effect-selection-rubric.md as the current rubric.
For recorded input, also emit timing-cues.json or include an equivalent cue block: timing source, cue phrase, timestamp, confidence, and whether it is ASR-derived, manual, or provisional.
If the recording is Chinese or the ASR cue confidence is low, use the global ASR timing review skill to emit a human-review transcript ledger before final timing is accepted. Programmatic cue matching is not sufficient by itself.
For presenter-aware gesture videos, every render plan must include a debugging boolean on the overlay segment. debugging: true shows the MediaPipe face, hand, fingertip, and active-gesture tracker layer. debugging: false still runs the tracker and uses it to drive the effect, but hides diagnostic boxes and shows the polished staged animation.
For gesture-triggered UI, cue timing only opens the reaction window. The detected hand box, fingertip, or palm box chooses the event, direction, and reveal progress. The final visual must then stabilize into an intentional composition: a pulled canvas sheet, a hand-swept rail or pushed-in image, a freeze gate, or a detail zoom. Do not accept cards that jitter by following raw bounding boxes every frame.
See references/artifacts.md for artifact shapes and references/global-style-spec.md for the current Apple-Keynote-style baseline.
Delegation Rules
- Use
shot-brief-writer to convert composer decisions into exact sub-node specs.
- Use
presenter-framing when deciding where the human should appear and how large.
- Use
apple-keynote-motion for motion continuity and build style.
- Use
hyperframes-atom-author for HTML/GSAP atoms.
- Use
generated-video-director for inserted generated-video clips.
- Use
caption-and-emphasis for text overlays and emphasis.
- Use
remotion-composer only after assets and shot briefs are ready.
- Use
render-qc after any render or visual preview.
Splitting Guidance
A shot may contain internal motion and multiple layers. Do not split every sentence into a shot. Split when the visual responsibility changes:
- presenter leads
- presenter hands off to content
- content becomes inspectable
- generated video carries the idea
- a transition itself explains the relationship
- recap returns to the presenter or a clean memory hook
Transitions may be between shots, inside a shot, or their own shot. Choose based on meaning, not convenience.
Dependency Guidance
Start all independent asset jobs together. Keep a job linear only when it needs previous output, such as:
- a match transition that depends on the outgoing and incoming frame geometry
- a generated clip that must inherit colors, camera, or object position from a prior shot
- a final assembly that needs actual rendered atom durations
- QC that must inspect completed media
Prefer a DAG over a flat list when planning work.