| name | multimodal-llm |
| license | MIT |
| compatibility | Claude Code 2.1.76+. |
| author | OrchestKit |
| version | 2.1.1 |
| description | Vision, audio, video generation, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, generating AI video (Kling v3, Sora 2, Veo 3.1 std/lite/fast, Runway Gen-4.5 via `gen4_turbo`), or building multimodal AI pipelines. |
| tags | ["vision","audio","video","multimodal","image","speech","transcription","tts","kling","sora","veo","video-generation"] |
| user-invocable | false |
| disable-model-invocation | true |
| context | fork |
| complexity | high |
| persuasion-type | reference |
| effort | high |
| metadata | {"category":"mcp-enhancement"} |
| allowed-tools | ["Read","Glob","Grep","WebFetch","WebSearch"] |
Multimodal LLM Patterns
Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling v3, Sora 2, Veo 3.1 std/lite/fast tiers, Runway Gen-4.5 via gen4_turbo).
Canonical model IDs (pinned against yonatan-hq/platform/apps/api/app/config.py):
| Provider | Model IDs |
|---|
| Anthropic | claude-opus-4-7 (latest), claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-20251001 |
| OpenAI | gpt-5.2 (current flagship) |
| Google | gemini-3.1-pro-preview (flagship), gemini-3.1-flash-lite-preview (cost) |
| Veo | veo-3.1-generate-preview / veo-3.1-lite-generate-preview / veo-3.1-fast-generate-preview |
| Kling | kling-v3 (model_name field in Kling API) |
| Runway | gen4_turbo (product label: Gen-4.5) |
Quick Reference
| Category | Rules | Impact | When to Use |
|---|
| Vision: Image Analysis | 1 | HIGH | Image captioning, VQA, multi-image comparison, object detection |
| Vision: Document Understanding | 1 | HIGH | OCR, chart/diagram analysis, PDF processing, table extraction |
| Vision: Model Selection | 1 | MEDIUM | Choosing provider, cost optimization, image size limits |
| Audio: Speech-to-Text | 1 | HIGH | Transcription, speaker diarization, long-form audio |
| Audio: Text-to-Speech | 1 | MEDIUM | Voice synthesis, expressive TTS, multi-speaker dialogue |
| Audio: Model Selection | 1 | MEDIUM | Real-time voice agents, provider comparison, pricing |
| Video: Model Selection | 1 | HIGH | Choosing video gen provider (Kling, Sora, Veo, Runway) |
| Video: API Patterns | 1 | HIGH | Async task polling, SDK integration, webhook callbacks |
| Video: Multi-Shot | 1 | HIGH | Storyboarding, character elements, scene consistency |
Total: 9 rules across 3 categories (Vision, Audio, Video Generation)
Vision: Image Analysis
Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.
| Rule | File | Key Pattern |
|---|
| Image Analysis | rules/vision-image-analysis.md | Base64 encoding, multi-image, bounding boxes |
Vision: Document Understanding
Extract structured data from documents, charts, and PDFs using vision models.
| Rule | File | Key Pattern |
|---|
| Document Vision | rules/vision-document.md | PDF page ranges, detail levels, OCR strategies |
Vision: Model Selection
Choose the right vision provider based on accuracy, cost, and context window needs.
| Rule | File | Key Pattern |
|---|
| Vision Models | rules/vision-models.md | Provider comparison, token costs, image limits |
Audio: Speech-to-Text
Convert audio to text with speaker diarization, timestamps, and sentiment analysis.
| Rule | File | Key Pattern |
|---|
| Speech-to-Text | rules/audio-speech-to-text.md | Gemini long-form, GPT-4o-Transcribe, AssemblyAI features |
Audio: Text-to-Speech
Generate natural speech from text with voice selection and expressive cues.
| Rule | File | Key Pattern |
|---|
| Text-to-Speech | rules/audio-text-to-speech.md | Gemini TTS, voice config, auditory cues |
Audio: Model Selection
Select the right audio/voice provider for real-time, transcription, or TTS use cases.
| Rule | File | Key Pattern |
|---|
| Audio Models | rules/audio-models.md | Real-time voice comparison, STT benchmarks, pricing |
Video: Model Selection
Choose the right video generation provider based on use case, duration, and budget.
| Rule | File | Key Pattern |
|---|
| Video Models | rules/video-generation-models.md | Kling vs Sora vs Veo vs Runway, pricing, capabilities |
Video: API Patterns
Integrate video generation APIs with proper async polling, SDKs, and webhook callbacks.
| Rule | File | Key Pattern |
|---|
| API Integration | rules/video-generation-patterns.md | Kling REST, fal.ai SDK, Vercel AI SDK, task polling |
Video: Multi-Shot
Generate multi-scene videos with consistent characters using storyboarding and character elements.
| Rule | File | Key Pattern |
|---|
| Multi-Shot | rules/video-multi-shot.md | Kling v3 character elements, 6-shot storyboards, identity binding |
Key Decisions
| Decision | Recommendation |
|---|
| High accuracy vision | claude-opus-4-7 (2,576 px, 3× Opus 4.6) or gpt-5.2 |
| Long documents | gemini-3.1-pro-preview (1M+ context) |
| Cost-efficient vision | gemini-3.1-flash-lite-preview (replaces Gemini 2.5 Flash, deprecates Oct 2026) |
| Video analysis | gemini-3.1-pro-preview (native video, supersedes 2.5 Pro) |
| Voice assistant | Grok Voice Agent on Grok 4.20 (fastest, <1s) |
| Emotional voice AI | Gemini Live API |
| Long audio transcription | gemini-3.1-pro-preview (9.5hr) |
| Speaker diarization | AssemblyAI or Gemini |
| Self-hosted STT | Whisper Large V3 |
| Character-consistent video | kling-v3 (Character Elements 3.0) |
| Narrative video / storytelling | Sora 2 (best cause-and-effect coherence) |
| Cinematic B-roll | veo-3.1-generate-preview (camera control + polished motion) |
| Budget drafts | veo-3.1-lite-generate-preview (~$0.05/s, 720/1080p) |
| Mid-tier fast renders | veo-3.1-fast-generate-preview |
| Professional VFX | Runway gen4_turbo (Act-Two motion transfer) |
| High-volume social video | kling-v3 Standard (~$0.20/video) |
| Open-source video gen | Wan 2.6 or LTX-2 |
| Lip-sync / avatar video | kling-v3 (native lip-sync API) |
Example
import anthropic, base64
client = anthropic.Anthropic()
with open("image.png", "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
{"type": "text", "text": "Describe this image"}
]}]
)
Common Mistakes
- Not setting
max_tokens on vision requests (responses truncated)
- Sending oversized images without resizing (>2048px)
- Using
high detail level for simple yes/no classification
- Using STT+LLM+TTS pipeline instead of native speech-to-speech
- Not leveraging barge-in support for natural voice conversations
- Using deprecated models (GPT-4V, Whisper-1)
- Ignoring rate limits on vision and audio endpoints
- Calling video generation APIs synchronously (they're async — poll or use callbacks)
- Generating separate clips without character elements (characters look different each time)
- Using Sora for high-volume social content (expensive, slow — use Kling Standard instead)
Related Skills
ork:rag-retrieval - Multimodal RAG with image + text retrieval
ork:llm-integration - General LLM function calling patterns
streaming-api-patterns - WebSocket patterns for real-time audio
ork:demo-producer - Terminal demo videos (VHS, asciinema) — not AI video gen