Run any Skill in Manus with one click

multimodal-llm

Vision, audio, video generation, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, generating AI video (Kling v3, Sora 2, Veo 3.1 std/lite/fast, Runway Gen-4.5 via `gen4_turbo`), or building multimodal AI pipelines.

Run Skill in Manus

Stars189

Forks15

UpdatedJune 13, 2026 at 20:40

Source

yonatangross

yonatangross/orchestkit

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Software DevelopersComputer and Mathematical Occupations15-1252L4

File Explorer

13 files

SKILL.md

readonly

More from this repository

same repository

dream

yonatangross/orchestkit

Nightly memory consolidation — prunes stale entries, merges duplicates, resolves contradictions, rebuilds MEMORY.md index. Use when memory files have accumulated over many sessions and need cleanup. Do NOT use for storing new decisions (use remember) or searching memory (use memory).

2026-06-16189

configure

yonatangross/orchestkit

Interactive configuration wizard for OrchestKit plugin settings including MCP server enablement, hook permissions, keybindings, and installation presets (Complete/Standard/Lite). Supports preset shortcuts, per-category skill customization, and webhook configuration. Use when customizing plugin behavior or managing settings.

2026-06-16189

doctor

yonatangross/orchestkit

OrchestKit doctor for health diagnostics across manifest integrity, hook configuration, skill validation, agent frontmatter, MCP server connectivity, CC version compatibility, and permission rules. Reports issues with severity levels and auto-remediation suggestions. Validates component counts, detects orphaned entries, and checks CC version matrix compliance. Use when diagnosing plugin health, troubleshooting configuration issues, or running pre-release checks.

2026-06-16189

browser-tools

yonatangross/orchestkit

OrchestKit security wrapper for browser automation. Adds URL blocklisting, rate limiting, robots.txt enforcement, and ethical scraping guardrails on top of the upstream agent-browser skill. Use when automating browser workflows that need safety guardrails.

2026-06-16189

emulate-seed

yonatangross/orchestkit

Generate emulate seed configs for stateful API emulation. Wraps Vercel's emulate tool for GitHub, Vercel, Google OAuth, Slack, Apple Auth, Microsoft Entra, AWS (S3/SQS/IAM), Okta, Clerk, Resend, Stripe, and MongoDB Atlas APIs. Not mocks — full state machines where create-a-PR-and-it-appears-in-the-list, send-an-email-and-retrieve-from-local-inbox. Use when setting up test environments, CI pipelines, integration tests, or offline development.

2026-06-16189

scope-appropriate-architecture

yonatangross/orchestkit

Right-sizes architecture to project scope. Prevents over-engineering by classifying projects into 6 tiers and constraining pattern choices accordingly. Use when designing architecture, selecting patterns, or when brainstorm/implement detect a project tier.

2026-06-15189

name	multimodal-llm
license	MIT
compatibility	Claude Code 2.1.170+.
author	OrchestKit
version	2.1.1
description	Vision, audio, video generation, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, generating AI video (Kling v3, Sora 2, Veo 3.1 std/lite/fast, Runway Gen-4.5 via `gen4_turbo`), or building multimodal AI pipelines.
tags	["vision","audio","video","multimodal","image","speech","transcription","tts","kling","sora","veo","video-generation"]
user-invocable	false
disable-model-invocation	true
context	fork
complexity	high
persuasion-type	reference
effort	high
metadata	{"category":"mcp-enhancement"}
allowed-tools	["Read","Glob","Grep","WebFetch","WebSearch"]

Multimodal LLM Patterns

Integrate vision, audio, and video generation capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, text-to-speech, and AI video generation (Kling v3, Sora 2, Veo 3.1 std/lite/fast tiers, Runway Gen-4.5 via gen4_turbo).

Canonical model IDs (pinned against yonatan-hq/platform/apps/api/app/config.py):

Provider Model IDs
Anthropic claude-opus-4-8 (recommended — 2,576 px budget, production default), claude-opus-4-7, claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-20251001. claude-fable-5 is unavailable — Anthropic suspended access for all customers per a US export-control directive (2026-06-12); do not pin or recommend it
OpenAI gpt-5.5 (current flagship)
Google gemini-3.1-pro-preview (flagship), gemini-3.1-flash-lite-preview (cost)
Veo veo-3.1-generate-preview / veo-3.1-lite-generate-preview / veo-3.1-fast-generate-preview
Kling kling-v3 (model_name field in Kling API)
Runway gen4_turbo (product label: Gen-4.5)

Quick Reference

Category	Rules	Impact	When to Use
Vision: Image Analysis	1	HIGH	Image captioning, VQA, multi-image comparison, object detection
Vision: Document Understanding	1	HIGH	OCR, chart/diagram analysis, PDF processing, table extraction
Vision: Model Selection	1	MEDIUM	Choosing provider, cost optimization, image size limits
Audio: Speech-to-Text	1	HIGH	Transcription, speaker diarization, long-form audio
Audio: Text-to-Speech	1	MEDIUM	Voice synthesis, expressive TTS, multi-speaker dialogue
Audio: Model Selection	1	MEDIUM	Real-time voice agents, provider comparison, pricing
Video: Model Selection	1	HIGH	Choosing video gen provider (Kling, Sora, Veo, Runway)
Video: API Patterns	1	HIGH	Async task polling, SDK integration, webhook callbacks
Video: Multi-Shot	1	HIGH	Storyboarding, character elements, scene consistency

Total: 9 rules across 3 categories (Vision, Audio, Video Generation)

Vision: Image Analysis

Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set max_tokens and resize images before encoding.

Rule	File	Key Pattern
Image Analysis	`rules/vision-image-analysis.md`	Base64 encoding, multi-image, bounding boxes

Vision: Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

Rule	File	Key Pattern
Document Vision	`rules/vision-document.md`	PDF page ranges, detail levels, OCR strategies

Vision: Model Selection

Choose the right vision provider based on accuracy, cost, and context window needs.

Rule	File	Key Pattern
Vision Models	`rules/vision-models.md`	Provider comparison, token costs, image limits

Audio: Speech-to-Text

Convert audio to text with speaker diarization, timestamps, and sentiment analysis.

Rule	File	Key Pattern
Speech-to-Text	`rules/audio-speech-to-text.md`	Gemini long-form, GPT-4o-Transcribe, AssemblyAI features

Audio: Text-to-Speech

Generate natural speech from text with voice selection and expressive cues.

Rule	File	Key Pattern
Text-to-Speech	`rules/audio-text-to-speech.md`	Gemini TTS, voice config, auditory cues

Audio: Model Selection

Select the right audio/voice provider for real-time, transcription, or TTS use cases.

Rule	File	Key Pattern
Audio Models	`rules/audio-models.md`	Real-time voice comparison, STT benchmarks, pricing

Video: Model Selection

Choose the right video generation provider based on use case, duration, and budget.

Rule	File	Key Pattern
Video Models	`rules/video-generation-models.md`	Kling vs Sora vs Veo vs Runway, pricing, capabilities

Video: API Patterns

Integrate video generation APIs with proper async polling, SDKs, and webhook callbacks.

Rule	File	Key Pattern
API Integration	`rules/video-generation-patterns.md`	Kling REST, fal.ai SDK, Vercel AI SDK, task polling

Video: Multi-Shot

Generate multi-scene videos with consistent characters using storyboarding and character elements.

Rule	File	Key Pattern
Multi-Shot	`rules/video-multi-shot.md`	Kling v3 character elements, 6-shot storyboards, identity binding

Key Decisions

Decision	Recommendation
High accuracy vision	`claude-opus-4-8` (production default — 2,576 px vision budget, 3× what Opus 4.6 allotted). (`claude-fable-5` was the SOTA option but access is suspended for all customers per a US export-control directive, 2026-06-12 — unavailable)
Long documents	`gemini-3.1-pro-preview` (1M+ context)
Cost-efficient vision	`gemini-3.1-flash-lite-preview` (replaces Gemini 2.5 Flash, deprecates Oct 2026)
Video analysis	`gemini-3.1-pro-preview` (native video, supersedes 2.5 Pro)
Voice assistant	Grok Voice Agent on Grok 4.20 (fastest, <1s)
Emotional voice AI	Gemini Live API
Long audio transcription	`gemini-3.1-pro-preview` (9.5hr)
Speaker diarization	AssemblyAI or Gemini
Self-hosted STT	Whisper Large V3
Character-consistent video	`kling-v3` (Character Elements 3.0)
Narrative video / storytelling	Sora 2 (best cause-and-effect coherence)
Cinematic B-roll	`veo-3.1-generate-preview` (camera control + polished motion)
Budget drafts	`veo-3.1-lite-generate-preview` (~$0.05/s, 720/1080p)
Mid-tier fast renders	`veo-3.1-fast-generate-preview`
Professional VFX	Runway `gen4_turbo` (Act-Two motion transfer)
High-volume social video	`kling-v3` Standard (~$0.20/video)
Open-source video gen	Wan 2.6 or LTX-2
Lip-sync / avatar video	`kling-v3` (native lip-sync API)

Example

import anthropic, base64

client = anthropic.Anthropic()
with open("image.png", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Describe this image"}
    ]}]
)

Common Mistakes

Not setting max_tokens on vision requests (responses truncated)
Sending oversized images without resizing (>2048px)
Using high detail level for simple yes/no classification
Using STT+LLM+TTS pipeline instead of native speech-to-speech
Not leveraging barge-in support for natural voice conversations
Using deprecated models (GPT-4V, Whisper-1)
Ignoring rate limits on vision and audio endpoints
Calling video generation APIs synchronously (they're async — poll or use callbacks)
Generating separate clips without character elements (characters look different each time)
Using Sora for high-volume social content (expensive, slow — use Kling Standard instead)

Related Skills

ork:rag-retrieval - Multimodal RAG with image + text retrieval
ork:llm-integration - General LLM function calling patterns
streaming-api-patterns - WebSocket patterns for real-time audio
ork:demo-producer - Terminal demo videos (VHS, asciinema) — not AI video gen

Provider	Model IDs
Anthropic	`claude-opus-4-8` (recommended — 2,576 px budget, production default), `claude-opus-4-7`, `claude-opus-4-6`, `claude-sonnet-4-6`, `claude-haiku-4-5-20251001`. `claude-fable-5` is unavailable — Anthropic suspended access for all customers per a US export-control directive (2026-06-12); do not pin or recommend it
OpenAI	`gpt-5.5` (current flagship)
Google	`gemini-3.1-pro-preview` (flagship), `gemini-3.1-flash-lite-preview` (cost)
Veo	`veo-3.1-generate-preview` / `veo-3.1-lite-generate-preview` / `veo-3.1-fast-generate-preview`
Kling	`kling-v3` (model_name field in Kling API)
Runway	`gen4_turbo` (product label: Gen-4.5)