with one click
vss-deploy-dense-captioning
Use this skill when deploying standalone RT-VLM dense captioning or calling its REST API (uploads, captions, streams, chat-completions, Kafka). Not for VSS profile deploy or video-search ingestion.
Menu
Use this skill when deploying standalone RT-VLM dense captioning or calling its REST API (uploads, captions, streams, chat-completions, Kafka). Not for VSS profile deploy or video-search ingestion.
| name | vss-deploy-dense-captioning |
| description | Use this skill when deploying standalone RT-VLM dense captioning or calling its REST API (uploads, captions, streams, chat-completions, Kafka). Not for VSS profile deploy or video-search ingestion. |
| license | Apache-2.0 |
| metadata | {"version":"3.2.0","github-url":"https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization","tags":"nvidia blueprint operational deployment"} |
Stand up the RT-VLM dense-captioning microservice on its own and exercise every endpoint it exposes (file upload, generate_captions, stream add/delete, chat-completions, Kafka topics).
For standalone RT-VLM deployment:
$NGC_CLI_API_KEY for docker login nvcr.io,
image pulls, and local NGC model/artifact downloads.curl, jq, and any writable working directory for the standalone compose copy.For API calls against an existing service:
$BASE_URL.$RTVI_VLM_API_KEY or $NGC_CLI_API_KEY, depending on how the
service was configured.For full VSS profile deployment:
../vss-deploy-profile/SKILL.md; this skill does not deploy full VSS profiles.Follow the routing tables and step-by-step workflows below. Each section that ends in workflow, quick start, or flow is intended to be executed top-to-bottom. Detailed reference material lives in references/; execute the documented workflows directly unless a future revision names a concrete helper.
Worked end-to-end examples are kept under evals/ (each *.json manifest contains a runnable scenario) and inline in the per-workflow curl blocks below. Run a Tier-3 evaluation with nv-base validate <this-skill-dir> --agent-eval to replay them.
NGC_CLI_API_KEY, RTVI_VLM_API_KEY, and .env files out of git and out of logs; do not echo credential values or include them in final responses.sudo are effectively root-level privileges. Use the non-interactive sudo -n guard in the deploy reference and stop for host-owner action when passwordless sudo is unavailable./docs or /health; redeploy via vss-deploy-profile or the matching vss-deploy-* skill.NGC_CLI_API_KEY. Solution: docker login nvcr.io and re-export the key before retrying.docker compose down.RT-VLM is NVIDIA's real-time vision-language microservice: decode video (file or
RTSP), segment it into chunks, run a VLM (cosmos-reason1, cosmos-reason2, or any
OpenAI-compatible model), stream dense captions back over SSE/HTTP, and publish
captions, incident alerts, and errors to Kafka. Use this skill to deploy the
standalone RT-VLM service when a full VSS profile is not already running, then call
its /v1/... API for caption generation, file upload, live-stream management, health
checks, NIM-compatible chat completions, or Prometheus metrics. API reference:
https://docs.nvidia.com/vss/latest/real-time-vlm-api.html.
If the user asks to deploy a full VSS profile, use
../vss-deploy-profile/SKILL.md. That skill
owns profile routing, generated.env, resolved.yml, multi-service sizing, and
full-stack deploy/teardown.
If the user asks for standalone RT-VLM dense captioning, or no VSS profile is
already running, use the standalone RT-VLM flow in
references/deploy-rt-vlm-service.md
before calling the API. This follows the same compose-centric pattern as
vss-deploy-profile: gather context, run preflights, work from a local copy,
dry-run with docker compose config, review, deploy, then wait for health.
Always follow this sequence. Never skip the dry-run.
# 1. Copy deploy/docker/services/rtvi/rtvi-vlm/rtvi-vlm-docker-compose.yml
# into any writable standalone working directory.
# 2. Derive RTVI_VLM_IMAGE_TAG from that compose copy.
# 3. Strip the standalone-only dangling depends_on block from the copy.
# 4. Create a gitignored .env with the required RT-VLM values.
# 5. Prepare host bind paths such as $VSS_DATA_DIR/data_log/vst/clip_storage.
# Use `sudo -n` for ownership fixes; if passwordless sudo is unavailable,
# stop and ask the host owner to run the printed command manually.
# 6. docker compose --env-file .env -f rtvi-vlm-docker-compose.yml config --quiet
# 7. docker pull the exact RT-VLM image tag.
# 8. docker compose ... up -d rtvi-vlm, wait for ready, then smoke test.
Run preflights before any pull or up; stop and fix failures here before
debugging RT-VLM itself:
nvidia-smi --query-gpu=index,name --format=csv,noheader
nvidia-container-cli info
docker compose version
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
For standalone single-file deployments, do not run the raw
deploy/docker/services/rtvi/rtvi-vlm/rtvi-vlm-docker-compose.yml directly: it
contains depends_on references to sibling VLM/NIM services that are only
defined in the full VSS/met-blueprints compose project. The standalone reference
shows how to copy the compose file, derive the current image tag from it, strip
the depends_on block, and validate the result before up.
For agent-driven validation, never let sudo prompt interactively. Before any
privileged ownership or Docker operation, use the non-interactive guard in
references/deploy-rt-vlm-service.md:
prefer plain docker; otherwise use sudo -n docker; if sudo -n fails, stop
with the exact manual command for the host owner instead of retrying with
interactive sudo or weakening permissions.
If docker pull fails with a containerd snapshotter/unpack error on Docker 28+,
apply the /etc/docker/daemon.json containerd-snapshotter=false fix in the
standalone reference before retrying.
Minimum standalone .env values:
| Host env var | Required when | Purpose |
|---|---|---|
NGC_CLI_API_KEY | Standalone deploy path | NGC registry image pull and NGC model/artifact download |
RTVI_VLM_API_KEY or NGC_CLI_API_KEY | Authenticated API calls | RT-VLM bearer auth after the service is running |
RTVI_VLM_PORT | Always | Host API port mapped to container 8000 |
HOST_IP | Always | Kafka bootstrap host (${HOST_IP}:9092) |
VSS_DATA_DIR | Always | Required clip-storage bind mount |
RTVI_VLM_MODEL_TO_USE | Always for standalone | Backend selector; use cosmos-reason2 for the default local model or openai-compat for a remote/sibling endpoint |
RTVI_VLM_MODEL_PATH | Local self-hosted model | Source-backed Cosmos Reason 2 path: ngc:nim/nvidia/cosmos-reason2-8b:hf-1208 |
RTVI_VLM_ENDPOINT | RTVI_VLM_MODEL_TO_USE=openai-compat | Remote/sibling OpenAI-compatible VLM endpoint |
VLM_NAME | RTVI_VLM_MODEL_TO_USE=openai-compat | Model/deployment name exposed by that endpoint |
export BASE_URL="http://localhost:${RTVI_VLM_PORT:-8018}" # host-side RT-VLM port
export API_KEY="${NGC_CLI_API_KEY:-${RTVI_VLM_API_KEY:-}}" # bearer token used by host-side curl commands
: "${API_KEY:?Set NGC_CLI_API_KEY or RTVI_VLM_API_KEY before calling authenticated endpoints}"
Every request below uses Authorization: Bearer $API_KEY. Health endpoints
(/v1/health/*, /v1/ready, /v1/live, /v1/startup) typically work without auth.
Smoke test before use:
curl -fsS "$BASE_URL/v1/health/ready"
MODEL_ID="$(curl -fsS "$BASE_URL/v1/models" -H "Authorization: Bearer $API_KEY" | jq -r '.data[0].id // .id')"
curl -fsS "$BASE_URL/openapi.json" | jq -r '.paths | keys[]' | sort
When a task or eval names RTSP_SAMPLE_URL, treat that exact environment
variable as a required input. Verify it is set and non-empty before probing or
registering any stream; if it is missing, stop with a clear failure message. Do
not derive a substitute from NvStreamer, VIOS, sample-data bundles, or any other
fallback, because that validates a different stream than the caller requested.
: "${RTSP_SAMPLE_URL:?Set RTSP_SAMPLE_URL to a reachable RTSP sample stream before RTSP validation}"
case "$RTSP_SAMPLE_URL" in
rtsp://*) ;;
*) echo "RTSP_SAMPLE_URL must be an rtsp:// URL, got: $RTSP_SAMPLE_URL" >&2; exit 1 ;;
esac
if command -v ffprobe >/dev/null 2>&1; then
ffprobe -v error -rtsp_transport tcp \
-select_streams v:0 -show_entries stream=codec_type \
-of csv=p=0 "$RTSP_SAMPLE_URL" | grep -qx video
elif command -v gst-discoverer-1.0 >/dev/null 2>&1; then
gst-discoverer-1.0 "$RTSP_SAMPLE_URL" | grep -qi 'video'
else
echo "Install ffprobe or gst-discoverer-1.0 before RTSP validation." >&2
exit 1
fi
# 1. Upload the video, capture its file id
FILE_ID=$(curl -fsS -X POST "$BASE_URL/v1/files" \
-H "Authorization: Bearer $API_KEY" \
-F "file=@/path/to/warehouse.mp4" \
-F "purpose=vision" \
-F "media_type=video" | jq -r '.id')
# 2. Generate captions + alerts (SSE stream of chunked responses)
curl -N -X POST "$BASE_URL/v1/generate_captions" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"id\": \"$FILE_ID\",
\"prompt\": \"Write a concise dense caption for each 10-second segment of this warehouse video.\",
\"model\": \"$MODEL_ID\",
\"chunk_duration\": 10,
\"stream\": true
}"
Use the live OpenAPI as the source of truth before calling optional endpoints:
curl -fsS "$BASE_URL/openapi.json" | jq -r '.paths | keys[]' | sort
Core paths for VSS 3.2 are:
POST /v1/files for multipart media upload; pass the returned file id into
caption generation and delete the file when finished.POST /v1/generate_captions for file or stream captioning. Use the exact
model id returned by GET /v1/models; aliases such as cosmos-reason2 are
backend selectors, not request model ids.POST /v1/streams/add, GET /v1/streams/get-stream-info, and
DELETE /v1/streams/delete/{stream_id} for RTSP lifecycle. Parse stream ids
from results[0].id.POST /v1/chat/completions for OpenAI-compatible text and multimodal calls.
Current 26.05 builds return HTTP 400 for text-only /v1/completions; treat
that as expected when validating legacy behavior.GET /v1/health/ready, /v1/models, /v1/assets/stats, and /v1/metrics
for service probes. Do not assume /v1/license exists unless OpenAPI lists it.Detailed endpoint schemas, response shapes, CV-style singular stream endpoints,
and 26.05 compatibility notes live in
references/api-surface-26.05.md.
POST /v1/files, call
/v1/generate_captions with the returned file id, use stream=true for SSE,
then delete the file to release storage.RTSP_SAMPLE_URL, use that
exact URL and run the RTSP Sample Stream Guard before registration. Do not
derive a replacement stream from NvStreamer or VIOS when RTSP_SAMPLE_URL is
empty; fail fast instead. Require an actual video stream/caps entry before
registration; add the stream, caption it, then unregister it.Anomaly Detected: Yes/No line.
Kafka publication is server-side config, additive to HTTP responses, and
documented in references/kafka-workflows.md.vss-rtvi-vlm environment for topic names.
In a full VSS alerts real-time profile, use the existing VSS Kafka container
mdx-kafka for CLI checks and final incident-consumer commands. For
standalone validation, use a broker that advertises ${HOST_IP}:9092; never
stop or replace a pre-existing broker without user confirmation.Common causes: 400 for invalid request shape or model id, 401/403 for missing
or wrong bearer token, 404 for deleted files/streams or unsupported endpoints,
413 for oversized uploads, 422 for schema validation, 429 for too much
concurrency, 500 for inference/runtime failures, and 503 while startup is still
in progress. Inspect docker logs vss-rtvi-vlm for service-side failures.
Use this skill when the user wants to deploy, run, debug, tear down, or call the REST API of the RTVI-CV 2D detection / tracking microservice. Trigger when the user says things like 'deploy rtvi-cv', 'start warehouse 2d', 'add a stream', 'check rtvi-cv health', or 'stop the perception container'. Not for VLM, embedding, or analytics — use the matching vss-* skill.
Deploy and operate RTVI-CV-3D / MV3DT multi-camera 3D tracking: per-camera DeepStream perception plus BEV Fusion over calibrated cameras. Supports the bundled sample dataset, custom video files, and RTSP streams, and chains to `vss-generate-video-calibration` when calibration is missing. Use `vss-deploy-profile` for the full warehouse blueprint and `vss-deploy-detection-tracking-2d` for single-camera 2D detection.
Use to select, configure, deploy, verify, debug, or tear down a VSS profile (base, search, lvs, warehouse, edge). Not for standalone microservices — use the vss-deploy-* skill.
Use to run AutoMagicCalib on local MP4s, RTSP, or the bundled sample dataset, and to deploy vss-auto-calibration when needed. Not for non-AMC calibration or runtime analytics.
Use this skill when producing a VSS analysis report — Mode A per-clip VLM, Mode B incident-range via video-analytics. Not for standalone video summarization, real-time alerts, or ad-hoc Q&A.
Use for VSS alert workflows — real-time monitoring, Alert-Bridge subscriptions, Slack notifications, incident queries, camera onboarding. Not for non-alert analytics.