Run any Skill in Manus with one click

multimodal-llm

Stars193

Forks15

UpdatedJune 20, 2026 at 10:50

Vision, audio, video generation, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, generating AI video (Kling v3, Sora 2, Veo 3.1 std/lite/fast, Runway Gen-4.5 via `gen4_turbo`), or building multimodal AI pipelines.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

yonatangross

yonatangross/orchestkit

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software DevelopersComputer and Mathematical Occupations·SOC 15-1252

File Explorer

13 files

SKILL.md

readonly

More from this repository

same repository

doctor

yonatangross/orchestkit

OrchestKit doctor for health diagnostics across manifest integrity, hook configuration, skill validation, agent frontmatter, MCP server connectivity, CC version compatibility, and permission rules. Reports issues with severity levels and auto-remediation suggestions. Validates component counts, detects orphaned entries, and checks CC version matrix compliance. Use when diagnosing plugin health, troubleshooting configuration issues, or running pre-release checks.

2026-06-23193

mcp-visual-output

yonatangross/orchestkit

Interactive MCP visual output via @json-render/mcp. Upgrade plain JSON tool responses to interactive dashboards rendered in sandboxed iframes inside Claude, Cursor, ChatGPT, VS Code Copilot, Goose, and Postman conversations. Covers createMcpApp(), registerJsonRenderTool(), registerJsonRenderResource(), CSP config, JSON Patch streaming, and dashboard component patterns. Use when building MCP servers that return visual output, upgrading existing MCP tools with interactive UI, or creating eval/monitoring dashboards.

2026-06-22193

multi-surface-render

yonatangross/orchestkit

Multi-surface rendering with json-render — same JSON spec produces React web, Next.js apps, React Native, Ink terminal UIs, PDFs, emails, Remotion videos, OG images, and 3D scenes. Covers renderer target selection, registry mapping, and platform-specific APIs (renderToBuffer, renderToStream, renderToFile). Use when generating output for multiple platforms, creating PDF reports, email templates, demo videos, or social media images from a single component spec.

2026-06-22193

validate-counts

yonatangross/orchestkit

Validates hook, skill, and agent counts are consistent across CLAUDE.md, hooks.json, manifests, and source directories. Use when counts may be stale after adding or removing components, before releases, or when CLAUDE.md Project Overview looks wrong.

2026-06-22193

browser-tools

yonatangross/orchestkit

OrchestKit security wrapper for browser automation. Adds URL blocklisting, rate limiting, robots.txt enforcement, and ethical scraping guardrails on top of the upstream agent-browser skill. Use when automating browser workflows that need safety guardrails.

2026-06-22193

dev

yonatangross/orchestkit

One-command dev loop boot. Spins up portless (named HTTPS subdomain), emulate (stateful API mocks), the project's dev server, and an agent-browser session — all using the current git branch as the namespace key. Replaces the 4-terminal manual setup with a single `/ork:dev` invocation. Use when starting a new feature branch, switching worktrees, or returning to a project after a break. Skip silently when prerequisite binaries (portless, emulate, agent-browser) are missing — emits install hints.

2026-06-22193

name	multimodal-llm
license	MIT
compatibility	Claude Code 2.1.183+.
author	OrchestKit
version	2.1.1
description	Vision, audio, video generation, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, generating AI video (Kling v3, Sora 2, Veo 3.1 std/lite/fast, Runway Gen-4.5 via `gen4_turbo`), or building multimodal AI pipelines.
tags	["vision","audio","video","multimodal","image","speech","transcription","tts","kling","sora","veo","video-generation"]
user-invocable	false
disable-model-invocation	true
context	fork
complexity	high
persuasion-type	reference
effort	high
metadata	{"category":"mcp-enhancement"}
allowed-tools	["Read","Glob","Grep","WebFetch","WebSearch"]

Multimodal LLM Patterns

Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling v3, Sora 2, Veo 3.1 std/lite/fast tiers, Runway Gen-4.5 via gen4_turbo).

Canonical model IDs (pinned against yonatan-hq/platform/apps/api/app/config.py):

Provider Model IDs
Anthropic claude-opus-4-8 (recommended — 2,576 px budget, production default), claude-opus-4-7, claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-20251001. claude-fable-5 is unavailable — Anthropic suspended access for all customers per a US export-control directive (2026-06-12); do not pin or recommend it
OpenAI gpt-5.5 (current flagship)
Google gemini-3.1-pro-preview (flagship), gemini-3.1-flash-lite-preview (cost)
Veo veo-3.1-generate-preview / veo-3.1-lite-generate-preview / veo-3.1-fast-generate-preview
Kling kling-v3 (model_name field in Kling API)
Runway gen4_turbo (product label: Gen-4.5)

Quick Reference

Category	Rules	Impact	When to Use
Vision: Image Analysis	1	HIGH	Image captioning, VQA, multi-image comparison, object detection
Vision: Document Understanding	1	HIGH	OCR, chart/diagram analysis, PDF processing, table extraction
Vision: Model Selection	1	MEDIUM	Choosing provider, cost optimization, image size limits
Audio: Speech-to-Text	1	HIGH	Transcription, speaker diarization, long-form audio
Audio: Text-to-Speech	1	MEDIUM	Voice synthesis, expressive TTS, multi-speaker dialogue
Audio: Model Selection	1	MEDIUM	Real-time voice agents, provider comparison, pricing
Video: Model Selection	1	HIGH	Choosing video gen provider (Kling, Sora, Veo, Runway)
Video: API Patterns	1	HIGH	Async task polling, SDK integration, webhook callbacks
Video: Multi-Shot	1	HIGH	Storyboarding, character elements, scene consistency

Total: 9 rules across 3 categories (Vision, Audio, Video Generation)

Vision: Image Analysis

Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.

Rule	File	Key Pattern
Image Analysis	`rules/vision-image-analysis.md`	Base64 encoding, multi-image, bounding boxes

Vision: Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

Rule	File	Key Pattern
Document Vision	`rules/vision-document.md`	PDF page ranges, detail levels, OCR strategies

Vision: Model Selection

Choose the right vision provider based on accuracy, cost, and context window needs.

Rule	File	Key Pattern
Vision Models	`rules/vision-models.md`	Provider comparison, token costs, image limits

Audio: Speech-to-Text

Convert audio to text with speaker diarization, timestamps, and sentiment analysis.

Rule	File	Key Pattern
Speech-to-Text	`rules/audio-speech-to-text.md`	Gemini long-form, GPT-4o-Transcribe, AssemblyAI features

Audio: Text-to-Speech

Generate natural speech from text with voice selection and expressive cues.

Rule	File	Key Pattern
Text-to-Speech	`rules/audio-text-to-speech.md`	Gemini TTS, voice config, auditory cues

Audio: Model Selection

Select the right audio/voice provider for real-time, transcription, or TTS use cases.

Rule	File	Key Pattern
Audio Models	`rules/audio-models.md`	Real-time voice comparison, STT benchmarks, pricing

Video: Model Selection

Choose the right video generation provider based on use case, duration, and budget.

Rule	File	Key Pattern
Video Models	`rules/video-generation-models.md`	Kling vs Sora vs Veo vs Runway, pricing, capabilities

Video: API Patterns

Integrate video generation APIs with proper async polling, SDKs, and webhook callbacks.

Rule	File	Key Pattern
API Integration	`rules/video-generation-patterns.md`	Kling REST, fal.ai SDK, Vercel AI SDK, task polling

Video: Multi-Shot

Generate multi-scene videos with consistent characters using storyboarding and character elements.

Rule	File	Key Pattern
Multi-Shot	`rules/video-multi-shot.md`	Kling v3 character elements, 6-shot storyboards, identity binding

Key Decisions

Decision	Recommendation
High accuracy vision	`claude-opus-4-8` (production default — 2,576 px vision budget, 3× what Opus 4.6 allotted). (`claude-fable-5` was the SOTA option but access is suspended for all customers per a US export-control directive, 2026-06-12 — unavailable)
Long documents	`gemini-3.1-pro-preview` (1M+ context)
Cost-efficient vision	`gemini-3.1-flash-lite-preview` (replaces Gemini 2.5 Flash, deprecates Oct 2026)
Video analysis	`gemini-3.1-pro-preview` (native video, supersedes 2.5 Pro)
Voice assistant	Grok Voice Agent on Grok 4.20 (fastest, <1s)
Emotional voice AI	Gemini Live API
Long audio transcription	`gemini-3.1-pro-preview` (9.5hr)
Speaker diarization	AssemblyAI or Gemini
Self-hosted STT	Whisper Large V3
Character-consistent video	`kling-v3` (Character Elements 3.0)
Narrative video / storytelling	Sora 2 (best cause-and-effect coherence)
Cinematic B-roll	`veo-3.1-generate-preview` (camera control + polished motion)
Budget drafts	`veo-3.1-lite-generate-preview` (~$0.05/s, 720/1080p)
Mid-tier fast renders	`veo-3.1-fast-generate-preview`
Professional VFX	Runway `gen4_turbo` (Act-Two motion transfer)
High-volume social video	`kling-v3` Standard (~$0.20/video)
Open-source video gen	Wan 2.6 or LTX-2
Lip-sync / avatar video	`kling-v3` (native lip-sync API)

Example

import anthropic, base64

client = anthropic.Anthropic()
with open("image.png", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Describe this image"}
    ]}]
)

Common Mistakes

Not setting max_tokens on vision requests (responses truncated)
Sending oversized images without resizing (>2048px)
Using high detail level for simple yes/no classification
Using STT+LLM+TTS pipeline instead of native speech-to-speech
Not leveraging barge-in support for natural voice conversations
Using deprecated models (GPT-4V, Whisper-1)
Ignoring rate limits on vision and audio endpoints
Calling video generation APIs synchronously (they're async — poll or use callbacks)
Generating separate clips without character elements (characters look different each time)
Using Sora for high-volume social content (expensive, slow — use Kling Standard instead)

Related Skills

ork:rag-retrieval - Multimodal RAG with image + text retrieval
ork:llm-integration - General LLM function calling patterns
streaming-api-patterns - WebSocket patterns for real-time audio
ork:demo-producer - Terminal demo videos (VHS, asciinema) — not AI video gen

Provider	Model IDs
Anthropic	`claude-opus-4-8` (recommended — 2,576 px budget, production default), `claude-opus-4-7`, `claude-opus-4-6`, `claude-sonnet-4-6`, `claude-haiku-4-5-20251001`. `claude-fable-5` is unavailable — Anthropic suspended access for all customers per a US export-control directive (2026-06-12); do not pin or recommend it
OpenAI	`gpt-5.5` (current flagship)
Google	`gemini-3.1-pro-preview` (flagship), `gemini-3.1-flash-lite-preview` (cost)
Veo	`veo-3.1-generate-preview` / `veo-3.1-lite-generate-preview` / `veo-3.1-fast-generate-preview`
Kling	`kling-v3` (model_name field in Kling API)
Runway	`gen4_turbo` (product label: Gen-4.5)