with one click
vss-generate-video-report
Use this skill when producing a VSS analysis report — Mode A per-clip VLM, Mode B incident-range via video-analytics. Not for standalone video summarization, real-time alerts, or ad-hoc Q&A.
Menu
Use this skill when producing a VSS analysis report — Mode A per-clip VLM, Mode B incident-range via video-analytics. Not for standalone video summarization, real-time alerts, or ad-hoc Q&A.
| name | vss-generate-video-report |
| description | Use this skill when producing a VSS analysis report — Mode A per-clip VLM, Mode B incident-range via video-analytics. Not for standalone video summarization, real-time alerts, or ad-hoc Q&A. |
| license | Apache-2.0 |
| metadata | {"version":"3.2.0","author":"NVIDIA Video Search and Summarization team","github-url":"https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization","tags":"nvidia blueprint operational"} |
Generate a video analysis report by routing to one of two backends — never via POST /generate on the VSS agent.
| Mode | Backend |
|---|---|
| A. Video clip | /vss-manage-video-io-storage → clip URL → VLM chat/completions |
| B. Incident range | /vss-query-analytics → incident list → narrative report |
If the request is ambiguous (e.g. "report on <sensor>" with no time range and no incident wording), default to Mode A. Ask only if the user mentions both a sensor and a time range. See Examples below for the request phrasings that route to each mode.
/vss-deploy-profile if its probe fails.$VSS_PUBLIC_HOST:$VSS_PUBLIC_PORT one-liner (Browser-playable clip URL) before embedding it in the report.Output contract for evaluators:
# Video Analysis Report.# Incident Range Report (never # Incident Report or sensor-named variants).## Basic Information with the exact required rows from the template (Report Identifier, Range, Scope, Total Incidents, Confirmed / Rejected / Unverified).<sensor-id>" → Mode A<sensor> last hour" → Mode B<sensor> between <t1> and <t2>" → Mode BDo not use this skill when the request is one of the following:
/vss-ask-video./vss-search-archive./vss-query-analytics./vss-deploy-profile./vss-manage-alerts.Never route reports through VSS-agent POST /generate.
Mode A needs the VSS base profile (VST + VLM NIM). Mode B needs the VSS alerts profile (VA-MCP + Elasticsearch).
Probe:
# Mode A — VST + VLM reachability
curl -sf --max-time 5 "http://${HOST_IP}:30888/vst/api/v1/sensor/version" >/dev/null
# Mode B — VA-MCP
curl -sf --max-time 5 "http://${HOST_IP}:9901/" >/dev/null
If the probe fails, hand off to /vss-deploy-profile with -p base (Mode A) or -p alerts (Mode B). Always confirm the deploy with the user first.
VST returns clip URLs using the agent-internal ${HOST_IP}:30888 host:port.
Keep that original URL as VIDEO_URL for local / in-cluster VLM frame pulls.
Do not rewrite the VLM input URL just to make it browser-playable.
Only create BROWSER_CLIP_URL for URLs shown in the rendered report. The
deploy layer exports the browser-facing host:port as $VSS_PUBLIC_HOST /
$VSS_PUBLIC_PORT (and scheme as $VSS_PUBLIC_HTTP_PROTOCOL) in every
profile .env — Brev or bare-metal — so the report-link rewrite is:
: "${VSS_PUBLIC_HOST:?Set VSS_PUBLIC_HOST before rewriting clip URLs}"
: "${VSS_PUBLIC_PORT:?Set VSS_PUBLIC_PORT before rewriting clip URLs}"
VSS_PUBLIC_HTTP_PROTOCOL="${VSS_PUBLIC_HTTP_PROTOCOL:-http}"
BROWSER_CLIP_URL=$(echo "$RAW_URL" | sed -E "s|^https?://[^/]+|${VSS_PUBLIC_HTTP_PROTOCOL}://${VSS_PUBLIC_HOST}:${VSS_PUBLIC_PORT}|")
If either required public host value is missing, omit the report-facing clip
link and call out that a browser-playable URL could not be produced; do not
block the local VLM analysis path. Apply the rewrite to every clip URL
surfaced in the rendered report (Mode A Step 4 Clip URL row; Mode B
per-incident clip sub-bullet). Leave the VLM video_url content block in Mode A
Step 3 on the original internal URL when the VLM is local / in-cluster.
If the VSS lvs profile is deployed — curl -sf --max-time 5 "http://${HOST_IP}:38111/v1/ready" returns HTTP 200 — run /vss-summarize-video to produce the summary, then paste its output into the report template in Step 4 and skip Steps 1–3 (the VLM-direct path). Run Steps 1–3 only when /v1/ready is non-200.
Hand off to /vss-manage-video-io-storage to:
List sensors and confirm the named <sensor-id> exists (upload first if not).
Fetch /storage/<streamId>/timelines for the recorded range when the user did not supply startTime / endTime.
Request a clip URL:
curl -s "http://${HOST_IP}:30888/vst/api/v1/storage/file/<streamId>/url?startTime=<startTime>&endTime=<endTime>&container=mp4&disableAudio=true" | jq -r .videoUrl
That gives a direct mp4 URL that the local / in-cluster VLM can pull frames from. Bind it to VIDEO_URL (used by the VLM in Step 3) and set RAW_URL="$VIDEO_URL" before applying the report-link rewrite to produce BROWSER_CLIP_URL for Step 4 — the user's browser cannot reach $VIDEO_URL directly.
Mode A requires the selected VLM endpoint to be able to fetch VIDEO_URL.
Local NIM/RT-VLM deployments normally can; remote endpoints generally cannot
fetch localhost, private HOST_IP, or VST-internal URLs. If the live
VLM_ENDPOINT is remote, surface that reachability requirement instead of
making a chat request that will fail after /v1/models succeeds.
The deploy may serve the VLM through either of two stacks. Both expose an OpenAI-compatible chat/completions API — pick whichever is live:
| Backend | Env vars | Typical host endpoint | Picked when |
|---|---|---|---|
| NIM Cosmos | VLM_BASE_URL, VLM_NAME, VLM_MODE, VLM_MODEL_TYPE | ${VLM_BASE_URL}/v1 (no trailing /v1 on the env var; the agent appends it) | VLM_MODEL_TYPE != rtvi and VLM_MODE ∈ {local, local_shared, remote} and VLM_BASE_URL is non-empty |
| RT-VLM Cosmos | RTVI_VLM_BASE_URL, RTVI_VLM_MODEL_TO_USE, VLM_MODEL_TYPE | ${RTVI_VLM_BASE_URL}/v1 — if unset, derive from ${HOST_IP} (http://${HOST_IP}:8018/v1 for alerts, http://${HOST_IP}:30082/v1 for base) | VLM_MODEL_TYPE = rtvi, or VLM_MODE=none, or VLM_BASE_URL empty; also the only path for warehouse |
Read the live values off the running agent container — do not guess:
docker exec vss-agent sh -lc '
for k in HOST_IP VLM_MODE VLM_MODEL_TYPE VLM_BASE_URL VLM_NAME RTVI_VLM_BASE_URL RTVI_VLM_MODEL_TO_USE; do
v="$(printenv "$k")"
[ -n "$v" ] && printf "%s=%s\n" "$k" "$v"
done
'
Do not require RTVI_VLM_ENDPOINT from vss-agent env; several profiles do not inject it.
Selection rule:
if [ "${VLM_MODEL_TYPE:-}" = "rtvi" ]; then
VLM_BACKEND="rtvlm"
VLM_ENDPOINT="${RTVI_VLM_BASE_URL:+${RTVI_VLM_BASE_URL%/}/v1}"
[ -z "${VLM_ENDPOINT}" ] && VLM_ENDPOINT="http://${HOST_IP}:8018/v1" # alerts default
VLM_MODEL="${RTVI_VLM_MODEL_TO_USE}"
elif [ -n "${VLM_BASE_URL}" ] && [ "${VLM_MODE}" != "none" ]; then
VLM_BACKEND="nim_cosmos"
VLM_ENDPOINT="${VLM_BASE_URL%/}/v1"
VLM_MODEL="${VLM_NAME}"
else
VLM_BACKEND="rtvlm"
VLM_ENDPOINT="${RTVI_VLM_BASE_URL:+${RTVI_VLM_BASE_URL%/}/v1}"
[ -z "${VLM_ENDPOINT}" ] && VLM_ENDPOINT="http://${HOST_IP}:30082/v1" # base default
VLM_MODEL="${RTVI_VLM_MODEL_TO_USE}"
fi
Probe /v1/models before sending a chat request to confirm the chosen endpoint is alive and the model is loaded:
curl -sf --max-time 5 "${VLM_ENDPOINT}/models" | jq -r '.data[].id'
If the probe fails or the listed ids don't include ${VLM_MODEL}, fall back to the other backend (or surface the error — never silently pick a model that isn't on the server).
Use the OpenAI-compatible chat/completions endpoint with a video_url content block — the same payload shape and multimodal settings video_understanding builds in src/vss_agents/tools/video_understanding.py (_build_vlm_messages + the Cosmos base_vlm.bind(...) call).
The frame sampling and visual-token (pixel) budget must mirror the live video_understanding settings for the active profile. Send mm_processor_kwargs and media_io_kwargs so the direct call uses the same frame sampling and pixel budget as the in-agent video_understanding tool — omitting them lets the VLM apply its own defaults, so the output diverges from the agent path.
PROMPT='Describe in detail what happens in the video, with timestamps (start–end in seconds from clip start) for each segment or event. Cover scenes, objects, people, vehicles, and notable actions.'
# Reasoning is OFF by default — matches the base-profile video_understanding config (`reasoning: false`).
# video_understanding.py uses config.reasoning unless the caller overrides it, so default to non-reasoning.
# Append the Cosmos Reason 2 reasoning suffix ONLY when the user explicitly asks for reasoning
# (drop it for non-cosmos-reason2 VLMs). With reasoning off, the response has no <think> block.
if [ "${REASONING:-false}" = "true" ]; then
PROMPT="${PROMPT}
Answer the question using the following format:
<think>
Your reasoning.
</think>
Write your final answer immediately after the </think> tag."
fi
# If Step 3 is run standalone, derive missing backend from current env/model.
[ -z "${VLM_BACKEND:-}" ] && {
if [ "${VLM_MODEL_TYPE:-}" = "rtvi" ]; then
VLM_BACKEND="rtvlm"
elif [[ "${VLM_MODEL:-}" == nvidia/cosmos* ]]; then
VLM_BACKEND="nim_cosmos"
else
VLM_BACKEND="rtvlm"
fi
}
# Multimodal settings — resolve from the live agent config file path, not hardcoded candidates.
CFG_JSON=$(
docker exec vss-agent python3 -c '
import json, os, yaml
p = os.getenv("VSS_AGENT_CONFIG_FILE")
if not p:
raise SystemExit("VSS_AGENT_CONFIG_FILE is not set in vss-agent")
if not os.path.isabs(p):
p = os.path.join("/vss-agent", p.lstrip("./"))
with open(p, encoding="utf-8") as f:
cfg = yaml.safe_load(f) or {}
vu = (cfg.get("functions", {}) or {}).get("video_understanding", {}) or {}
print(json.dumps({
"max_fps": int(vu.get("max_fps", 2)),
"max_frames": int(vu.get("max_frames", 30)),
"min_pixels": int(vu.get("min_pixels", 3136)),
"max_pixels": int(vu.get("max_pixels", 8388608)),
}))
')
)
[ -n "${CFG_JSON}" ] || { echo "Failed to read video_understanding config from vss-agent"; exit 1; }
jq -e . >/dev/null <<< "${CFG_JSON}" || { echo "Invalid config JSON from vss-agent"; exit 1; }
MAX_FPS="$(jq -r '.max_fps' <<< "${CFG_JSON}")"
MAX_FRAMES="$(jq -r '.max_frames' <<< "${CFG_JSON}")"
MIN_PIXELS="$(jq -r '.min_pixels' <<< "${CFG_JSON}")"
MAX_PIXELS="$(jq -r '.max_pixels' <<< "${CFG_JSON}")"
# num_frames = min(int(clip_seconds) * max_fps, max_frames), min 1 — matches video_understanding.py.
# clip_seconds (Step 1 endTime-startTime) may be fractional; truncate to integer seconds — bash $((...))
# is integer-only and errors on "15.0"/"1.5". Default 15s -> caps at MAX_FRAMES.
CLIP_SECONDS=$(awk -v s="${CLIP_SECONDS:-15}" 'BEGIN{printf "%d", s}')
NUM_FRAMES=$(( CLIP_SECONDS * MAX_FPS ))
[ "$NUM_FRAMES" -gt "$MAX_FRAMES" ] && NUM_FRAMES=$MAX_FRAMES
[ "$NUM_FRAMES" -lt 1 ] && NUM_FRAMES=1
# Only apply Cosmos mm/media kwargs on the NIM Cosmos path.
# RT-VLM mode uses its own server-side preprocessing and should not receive these kwargs.
MM_KWARGS=""
if [ "${VLM_BACKEND}" = "nim_cosmos" ]; then
case "$VLM_MODEL" in
*cosmos-reason2*) MM_KWARGS=", \"mm_processor_kwargs\": {\"size\": {\"shortest_edge\": ${MIN_PIXELS}, \"longest_edge\": ${MAX_PIXELS}}}, \"media_io_kwargs\": {\"video\": {\"num_frames\": ${NUM_FRAMES}}}" ;;
*cosmos*) MM_KWARGS=", \"mm_processor_kwargs\": {\"videos_kwargs\": {\"min_pixels\": ${MIN_PIXELS}, \"max_pixels\": ${MAX_PIXELS}}}, \"media_io_kwargs\": {\"video\": {\"num_frames\": ${NUM_FRAMES}}}" ;;
*) MM_KWARGS="" ;;
esac
fi
curl -s --connect-timeout 5 --max-time 120 -X POST "${VLM_ENDPOINT}/chat/completions" \
-H "Content-Type: application/json" \
-d @- <<EOF | jq -r '.choices[0].message.content'
{
"model": $(jq -Rs . <<< "${VLM_MODEL}"),
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": $(jq -Rs . <<< "${PROMPT}")},
{"type": "video_url", "video_url": {"url": $(jq -Rs . <<< "${VIDEO_URL}")}}
]
}
],
"max_tokens": 1024,
"temperature": 0.0${MM_KWARGS}
}
EOF
The kwargs block is backend-aware: on
nim_cosmos, Reason2 variants (nvidia/cosmos-reason2*) usemm_processor_kwargs.size{shortest_edge,longest_edge}and other NIM Cosmos variants (nvidia/cosmos*) usemm_processor_kwargs.videos_kwargs{min_pixels,max_pixels}; both also sendmedia_io_kwargs.video.num_frames. Onrtvlm, no Cosmos kwargs are sent.
If the VLM returns a <think>…</think> block (Cosmos Reason reasoning mode), keep only the text after </think> as the report body.
Copy assets/video-analysis-report.md, fill every placeholder, and return the rendered markdown to the user. Keep the source asset unchanged. Before rendering, verify BROWSER_CLIP_URL is set and non-empty, then replace <BROWSER_CLIP_URL> with that exact value in the Clip URL row. Never leave the placeholder in the output, never include template instructions in a filled cell, and never use the raw HOST_IP:30888 URL.
start_time / end_time must be ISO 8601 UTC (YYYY-MM-DDTHH:MM:SS.sssZ). Resolve relative phrases ("last hour", "today") against the current host clock.source + source_type=sensor. Otherwise leave both unset for an all-sensors query./vss-query-analyticsHand off to /vss-query-analytics (initialize → tools/call) with:
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "video_analytics__get_incidents",
"arguments": {
"source": "<sensor-id-or-omit>",
"source_type": "sensor",
"start_time": "<ISO>",
"end_time": "<ISO>",
"max_count": 100,
"includes": ["objectIds", "info"]
}
},
"id": 1
}
Read-only boundary (mandatory):
For each incident keep: id, sensorId, timestamp, end, category, place.name, info.verdict, info.reasoning, objectIds, and the clip URL (commonly info.clip_url, clip_url, or whichever clip-pointer field the response carries). Apply the $VSS_PUBLIC_HOST:$VSS_PUBLIC_PORT rewrite (see Browser-playable clip URL above) to every clip URL before pasting it into the report — the raw value is a HOST_IP:30888 URL the user's browser cannot reach.
Copy assets/incident-range-report.md, then group by sensor (or by category if no sensor scope), tally verdicts, and list each incident with timestamp / category / verdict / reasoning. Keep the source asset unchanged. Every incident clip value must be a rewritten browser-playable URL; omit the clip line when the incident carries no clip URL. Never include template instructions in a filled cell.
If get_incidents returns zero results, STOP and return exactly a one-line empty-range statement naming the requested range and scope. Do not render the full Incident Range template, do not invent incidents, do not seed test data, and do not fall back to Mode A.
curl, VLM call, or /vss-query-analytics request fails, stop the workflow and report the failing endpoint, HTTP status or command error, and the next useful recovery step. Do not fabricate a report from partial or missing data.info.reasoning, objectIds, clip URL) as omissions in the report, but treat missing id, timestamp, or category as a data-quality error that should be reported./vss-manage-video-io-storage — sensor list, timelines, and clip URL for Mode A Step 1./vss-query-analytics — incident retrieval (and verdict / reasoning enrichment) for Mode B Step 2./vss-ask-video — ad-hoc VLM Q&A on a single clip (not a structured report)./vss-summarize-video — used by Mode A to produce the summary body when the lvs profile is deployed; the report template (Step 4) is still filled here.Use this skill when deploying standalone RT-VLM dense captioning or calling its REST API (uploads, captions, streams, chat-completions, Kafka). Not for VSS profile deploy or video-search ingestion.
Use this skill when the user wants to deploy, run, debug, tear down, or call the REST API of the RTVI-CV 2D detection / tracking microservice. Trigger when the user says things like 'deploy rtvi-cv', 'start warehouse 2d', 'add a stream', 'check rtvi-cv health', or 'stop the perception container'. Not for VLM, embedding, or analytics — use the matching vss-* skill.
Deploy and operate RTVI-CV-3D / MV3DT multi-camera 3D tracking: per-camera DeepStream perception plus BEV Fusion over calibrated cameras. Supports the bundled sample dataset, custom video files, and RTSP streams, and chains to `vss-generate-video-calibration` when calibration is missing. Use `vss-deploy-profile` for the full warehouse blueprint and `vss-deploy-detection-tracking-2d` for single-camera 2D detection.
Use to select, configure, deploy, verify, debug, or tear down a VSS profile (base, search, lvs, warehouse, edge). Not for standalone microservices — use the vss-deploy-* skill.
Use to run AutoMagicCalib on local MP4s, RTSP, or the bundled sample dataset, and to deploy vss-auto-calibration when needed. Not for non-AMC calibration or runtime analytics.
Use for VSS alert workflows — real-time monitoring, Alert-Bridge subscriptions, Slack notifications, incident queries, camera onboarding. Not for non-alert analytics.