Run any Skill in Manus with one click

$pwd:

video-analyser

Name: Video Analyser
Author: mthines

// Analyse a video file — primarily a screen recording of a bug — to extract errors, UI state, and reproduction steps. Resolves input from a Linear ticket URL, a local file path, or a direct video URL. Extracts keyframes with ffmpeg, runs optional Tesseract OCR and Whisper audio transcription, then delivers structured findings. Trigger phrases: "analyse this video", "analyze this recording", "what does this video show", "extract bugs from this recording", "analyse this screen recording", "investigate this mp4", "investigate this mov", "analyse this clip", "look at this screen capture", "what is happening in this video", "analyse this screen capture", "video-analyser", "/video-analyser".

Run Skill in Manus

$ git log --oneline --stat

stars:0

forks:0

updated:May 6, 2026 at 15:35

SKILL.md

readonly

package.json

"author": "mthines"

"repository": "mthines/agent-skills"

View GitHub Repository

$ install --globalskills.sh

$ download --local

Run Skill in Manus

[HINT] Download the complete skill directory including SKILL.md and all related files

Run any Skill with one click

name	video-analyser
description	Analyse a video file — primarily a screen recording of a bug — to extract errors, UI state, and reproduction steps. Resolves input from a Linear ticket URL, a local file path, or a direct video URL. Extracts keyframes with ffmpeg, runs optional Tesseract OCR and Whisper audio transcription, then delivers structured findings. Trigger phrases: "analyse this video", "analyze this recording", "what does this video show", "extract bugs from this recording", "analyse this screen recording", "investigate this mp4", "investigate this mov", "analyse this clip", "look at this screen capture", "what is happening in this video", "analyse this screen capture", "video-analyser", "/video-analyser".
license	MIT
disable-model-invocation	true
metadata	{"author":"mthines","version":"1.0.0","workflow_type":"analysis","tags":["video","screen-recording","bug-analysis","ffmpeg","ocr","vision","frame-extraction"]}

Video Analyser Skill

Analyse a video file to extract bugs, errors, UI state, and reproduction steps. The pipeline uses ffmpeg for frame extraction, optional tesseract OCR for text, and optional whisper for audio narration. The default path (8 keyframes at 768 px) is the Pareto-optimal setting for screen recordings: best quality-per-token on the legibility curve.

Prerequisites

Required tools

Tool	Check	If missing
`ffmpeg`	`which ffmpeg`	Print: `ffmpeg is required. Install with: brew install ffmpeg` (macOS) or `apt install ffmpeg` (Linux). Then exit.
`ffprobe`	`which ffprobe`	Print: `ffprobe is required. It ships with ffmpeg — reinstall ffmpeg`. Then exit.

Optional tools (silent degradation)

Tool	If present	If absent
`tesseract`	Enable OCR mode for text-heavy frames	Skip silently; use vision-only mode
`whisper`	Enable audio transcription when user mentions narration or voiceover	Skip audio step silently

Run tool detection before any other step.

which ffmpeg  || { echo "ffmpeg is required. Install with: brew install ffmpeg (macOS) or apt install ffmpeg (Linux)."; exit 1; }
which ffprobe || { echo "ffprobe is required. It ships with ffmpeg — reinstall ffmpeg."; exit 1; }
OCR_ENABLED=false; which tesseract >/dev/null 2>&1 && OCR_ENABLED=true
AUDIO_ENABLED=false

Minimum ffmpeg version

Require ffmpeg 4 or later. The select='eq(pict_type\,I)' filter and -vsync vfr were introduced in ffmpeg 4. Document this requirement in the bail message if detection fails.

Temp Directory

Declare the temp directory and cleanup trap at the very start of execution, before source resolution. This ensures downloaded files, extracted frames, and audio are always removed, even on error.

WORK_DIR=$(mktemp -d /tmp/video-analyser-XXXXXX)
trap 'rm -rf "$WORK_DIR"' EXIT

Use $WORK_DIR as the staging area for all intermediate files throughout the pipeline.

Step 1 — Source Resolution

Resolve the user's input to a local file path before running anything else. Walk this table top-to-bottom; the first matching rule wins.

Input shape	Detection rule	Resolution steps
Linear ticket URL	Input matches `linear\.app/.+/issue/`	Follow the Linear resolution procedure below.
Local file path	Input starts with `/`, `./`, `~/`, or `~`	Follow the local path procedure below.
Direct video URL	Input matches `^https?://` and does not match `linear\.app`	Download: `curl -fL -o "$WORK_DIR/input.mp4" "$INPUT"`. Use `$WORK_DIR/input.mp4` as `VIDEO_PATH`.
Bare filename	Input contains no `/` and no `http`	Prepend `$PWD/`; then follow the local path procedure.
Unresolvable	None of the above match	Print: `Cannot resolve input to a video file. Provide a Linear ticket URL, a local file path, or a direct URL to a video file.` Then exit.

Linear ticket URL

Check Linear MCP availability.

if ! mcp list 2>/dev/null | grep -q linear; then
  echo "Linear MCP is not configured. Download the video from the ticket and re-invoke with a local file path."
  exit 1
fi

Do not attempt to scrape the Linear web UI as a fallback.

Extract the issue ID from the URL (e.g., XYZ-123 from https://linear.app/team/issue/XYZ-123).
Call Linear MCP get_issue with the issue ID.
Scan the description field and all comments for attachment URLs matching \.(mp4|mov|webm|avi) (case-insensitive).
If no video attachment is found, print: No video attachment found in Linear issue <ID>. Attach a .mp4, .mov, .webm, or .avi file to the issue and retry. Then exit.
Download the first matching attachment.

curl -fL -o "$WORK_DIR/input.mp4" "$ATTACHMENT_URL"
VIDEO_PATH="$WORK_DIR/input.mp4"

Local file path

Expand ~ to the home directory.

VIDEO_PATH="${INPUT/#\~/$HOME}"

Confirm the file exists.

test -f "$VIDEO_PATH" || { echo "File not found: $VIDEO_PATH"; exit 1; }

Check the file extension. If the extension is not .mp4, .mov, .webm, or .avi, print a warning but continue.

case "${VIDEO_PATH##*.}" in
  mp4|mov|webm|avi) ;;
  *) echo "Warning: unrecognised extension '${VIDEO_PATH##*.}'. Continuing anyway." ;;
esac

Step 2 — Video Probe

Run ffprobe to extract metadata before any frame work.

PROBE=$(ffprobe -v quiet -print_format json -show_streams -show_format "$VIDEO_PATH")
DURATION_S=$(echo "$PROBE" | python3 -c "import sys,json; d=json.load(sys.stdin); print(float(d['format']['duration']))")
WIDTH=$(echo "$PROBE" | python3 -c "import sys,json; d=json.load(sys.stdin); v=[s for s in d['streams'] if s.get('codec_type')=='video'][0]; print(v['width'])")
HEIGHT=$(echo "$PROBE" | python3 -c "import sys,json; d=json.load(sys.stdin); v=[s for s in d['streams'] if s.get('codec_type')=='video'][0]; print(v['height'])")
HAS_AUDIO=$(echo "$PROBE" | python3 -c "import sys,json; d=json.load(sys.stdin); print('true' if any(s.get('codec_type')=='audio' for s in d['streams']) else 'false')")

Step 3 — Bail Gates

Check these conditions in order before proceeding. Stop immediately when a gate triggers.

Condition	Action
`DURATION_S > 600`	Print the trim command below, then exit.
`DURATION_S < 1`	Print: `Video is less than 1 second — too short to analyse.` Then exit.
`WIDTH < 100`	Print a warning; continue (unusual resolution but allow).

Trim command to print when duration exceeds 600 s:

Video is longer than 10 minutes (${DURATION_S}s). Trim it first:
  ffmpeg -i "$VIDEO_PATH" -ss 0 -t 600 -c copy "$WORK_DIR/trimmed.mp4"
Re-invoke with: $WORK_DIR/trimmed.mp4

Step 4 — Determine Frame Count (N)

Use this table to select N. Apply the first matching row.

Video duration	User intent signal	N frames	Note
0–600 s	None (default)	8	Pareto-optimal default
0–600 s	Contains "fine-grained", "detailed", or "every frame"	16	Escalation
181–600 s	"fine-grained" + explicit opt-in after cost warning	24	Show cost warning first
> 600 s	Any	—	Bail at Step 3

Cost warning for 24-frame opt-in. Before proceeding to N=24, print:

Note: 24 frames at 768 px ≈ 18,864 tokens ≈ $0.057 on Sonnet 4.6.
Proceed? (yes/no)

Wait for explicit confirmation. If the user does not confirm, fall back to N=16.

Step 5 — Extract Frames

Use the keyframe-first strategy. I-frames in screen recordings mark scene transitions (page loads, error dialogs, modal appearances) and carry the most diagnostic signal.

Step 5a — Keyframe extraction

ffmpeg -i "$VIDEO_PATH" \
  -vf "select='eq(pict_type\,I)',scale=768:-2" \
  -vsync vfr -q:v 2 \
  "$WORK_DIR/frame_%04d.jpg" -y 2>/dev/null
IFRAME_COUNT=$(ls "$WORK_DIR"/frame_*.jpg 2>/dev/null | wc -l)

If $VIDEO_PATH is a .webm file and IFRAME_COUNT is 0, the container may not expose keyframe metadata. Fall back immediately to uniform sampling (Step 5b) with IFRAME_COUNT=0.

Step 5b — Uniform fill (when keyframes are sparse)

Condition	Action
`IFRAME_COUNT >= N`	Use the first N keyframes; skip Step 5b.
`IFRAME_COUNT < N`	Run uniform extraction; merge with keyframes.
`IFRAME_COUNT == 0`	Run uniform extraction only.

Uniform extraction command:

FILL_NEEDED=$((N - IFRAME_COUNT))
FPS=$(echo "scale=4; $FILL_NEEDED / $DURATION_S" | bc)
ffmpeg -i "$VIDEO_PATH" \
  -vf "fps=${FPS},scale=768:-2" \
  -q:v 2 \
  "$WORK_DIR/uniform_%04d.jpg" -y 2>/dev/null

Step 5c — Select final N frames

List all keyframes first (frame_*.jpg), sorted by name.
Append uniform frames (uniform_*.jpg), sorted by name.
Keep the first N entries.
Copy selected frames to $WORK_DIR/selected_%04d.jpg.

If fewer than N total frames were extracted, proceed with however many were extracted. Do not bail on a low frame count.

Scaling note

If scale=768:-2 produces an error (odd-dimension video), use scale=768:trunc(ow/a/2)*2 instead.

Step 6 — OCR (conditional)

Run OCR when ALL of the following are true:

OCR_ENABLED=true (tesseract is present).
The user's goal is text or error extraction (default — skip only if user explicitly requests visual-only analysis).

For each selected frame:

for FRAME in "$WORK_DIR"/selected_*.jpg; do
  FRAME_ID=$(basename "$FRAME" .jpg)
  tesseract "$FRAME" "$WORK_DIR/ocr_${FRAME_ID}" -l eng 2>/dev/null
done

Store OCR output in $WORK_DIR/ocr_selected_*.txt. Pass OCR text to the analysis prompt as <ocr_frame_N>TEXT</ocr_frame_N> blocks, one per frame. OCR text is cheaper context than asking vision to re-read the same text.

Non-English UI text. The default language is -l eng. If the user specifies a language (e.g., "the UI is in German"), substitute -l <lang> (e.g., -l deu). Tesseract language codes follow ISO 639-3.

Step 7 — Audio Transcription (conditional)

Run audio transcription when ALL of the following are true:

whisper is available on $PATH.
HAS_AUDIO=true (the video has an audio stream).
The user's message contains "narration", "voiceover", "audio", or "they said".

Extract audio and transcribe:

ffmpeg -i "$VIDEO_PATH" -vn -acodec pcm_s16le -ar 16000 "$WORK_DIR/audio.wav" -y 2>/dev/null
whisper "$WORK_DIR/audio.wav" --model small --output_format txt --output_dir "$WORK_DIR" 2>/dev/null

Append the transcript to the analysis prompt as <audio_transcript>TEXT</audio_transcript>.

Step 8 — Assemble Analysis Prompt

Use the cheapest delivery mechanism for the chosen frame count. For N ≤ 16, inline base64 is typically cheapest; for N > 16, prefer the Files API if the runtime supports it. The executor decides at runtime.

Construct the prompt as follows:

System:
  You are analysing a screen recording of a software application.
  Your goal: <USER_GOAL> (default: "identify bugs, errors, UI state issues, and reproduction steps").
  You will receive <N> frames extracted from the recording.
  Cross-reference OCR text blocks with what you see in each frame.
  Do not hallucinate text — if OCR and vision disagree, note both.

For each frame (in order):
  [image content block — JPEG]
  [if OCR enabled: <ocr_frame_N>OCR TEXT HERE</ocr_frame_N>]

[if audio transcription enabled:]
  <audio_transcript>TRANSCRIPT TEXT HERE</audio_transcript>

Request:
  Return findings in the structured format specified in Step 9.

Step 9 — Deliver Structured Output

Return findings in this exact schema. Do not omit sections; use "None detected" for sections with no findings.

## Video Analysis

### Recording summary
- Duration: <X> seconds
- Resolution: <W>×<H>
- Frames analysed: <N> (<method: keyframe | uniform | hybrid>)
- OCR: <enabled | disabled>
- Audio transcription: <enabled | disabled>

### Findings

#### Errors and exceptions
<List each error message, stack trace fragment, or exception found. Quote exact text where available.>

#### UI state at key moments
<For each significant frame transition, describe what the UI shows and what changed.>

#### Reproduction steps inferred
<Numbered list of steps to reproduce the observed behaviour.>

#### Recommended next steps
<Concrete actions the developer should take — e.g., "Check the network tab at frame 4", "Add error boundary at component X".>

Token Budget

The default setting (8 frames, 768 px) is the Pareto-optimal point on the quality-vs-token curve for screen recordings.

768 px is the curve knee for screen-recording legibility. UI text (error messages, stack traces, console output) is fully legible at 768 px. Going lower (512 px) risks misreading small-font terminal output. Going higher (1568 px) roughly doubles token cost with no meaningful legibility gain for screen content.

8 frames is the sweet spot for recordings under 3 min. I-frame sampling captures the scene transitions that carry the diagnostic signal. Beyond ~10 frames, incremental frames are typically near-duplicates that add tokens without adding context.

Frames	Resolution	Tokens/frame	Total image tokens	Cost (Sonnet 4.6, $3/M)	Use when
8 (default)	768 px	~786	~6,288	~$0.019	Standard bug recording; Pareto-optimal default
16 (escalation)	768 px	~786	~12,576	~$0.038	User says "fine-grained" or "detailed"
24 (long clip opt-in)	768 px	~786	~18,864	~$0.057	Clip 181–600 s + user explicitly opts in after cost warning
8	1568 px (cap)	~1,568	~12,544	~$0.038	Only if user reports text still unreadable at 768 px

For clips longer than 10 min (after the user trims), or for frame counts > 24, consider Gemini's native video API as an escalation path. Gemini charges per-second rather than per-frame and handles long clips natively.

Sampling Method Reference

Condition	Method
I-frame count ≥ N	Keyframe extraction only
I-frame count > 0 and < N	Hybrid: I-frames + uniform fill
I-frame count = 0 (e.g., some .webm)	Uniform time-based sampling only

Risks and Mitigations

Risk	Likelihood	Mitigation
`select='eq(pict_type\,I)'` rejected on ffmpeg < 4	Low	Document minimum version (ffmpeg 4+); bail with version check if needed.
`scale=768:-2` fails on odd-dimension video	Low	Substitute `scale=768:trunc(ow/a/2)*2`; documented in Step 5.
Tesseract produces garbage on non-English UI text	Medium	Default is `-l eng`; user can override with a language code.
Very short clips (< 1 s) yield 0 frames	Low	Bail gate in Step 3.
Claude vision hallucinates text not in frames	Low	Cross-reference OCR output when available; note disagreements.
Linear attachment URL requires authenticated download	Medium	Use the Linear MCP tool to obtain a pre-signed URL rather than a raw `curl`.

name	video-analyser
description	Analyse a video file — primarily a screen recording of a bug — to extract errors, UI state, and reproduction steps. Resolves input from a Linear ticket URL, a local file path, or a direct video URL. Extracts keyframes with ffmpeg, runs optional Tesseract OCR and Whisper audio transcription, then delivers structured findings. Trigger phrases: "analyse this video", "analyze this recording", "what does this video show", "extract bugs from this recording", "analyse this screen recording", "investigate this mp4", "investigate this mov", "analyse this clip", "look at this screen capture", "what is happening in this video", "analyse this screen capture", "video-analyser", "/video-analyser".
license	MIT
disable-model-invocation	true
metadata	{"author":"mthines","version":"1.0.0","workflow_type":"analysis","tags":["video","screen-recording","bug-analysis","ffmpeg","ocr","vision","frame-extraction"]}