| name | read-media |
| description | Analyze media files (images, video, audio) using AI vision and audio models with a critical-first lens. Use for output-first verification — render your deliverable, then read_media to see what it actually looks like. Distinguishes fundamental issues from surface-level fixes. |
Read Media — Critical-First Analysis
Analyze media files using read_media from the media-tools MCP server. This
is the primary tool for output-first verification — experiencing your work
as a user would, not just reading code.
Critical-First Philosophy
The vision model is instructed to be a critical reviewer by default:
"You are reviewing this work critically. Be honest about what you see —
name specific problems, explain their impact, and distinguish between
issues that need a fundamental rethink vs issues that are easy fixes.
Don't sugarcoat."
When no prompt is provided, the default is:
"Analyze this {media_type}. What works, what's broken, and what would
a demanding user complain about? Be specific and critical."
Responses include a foundation_sound assessment that flags when the
approach itself needs rethinking vs. when incremental fixes are sufficient.
When foundation_sound: false, treat this as a signal to consider
TRANSFORMATIVE changes rather than more polish.
You don't need to add "be critical" to your prompts — it's the default.
Focus your prompt on what to look at.
Quick Start
# Analyze a screenshot
read_media(prompt="What flaws or layout issues do you see?", file_paths=["screenshot.png"])
# Compare before and after
read_media(prompt="Compare these two versions — what improved and what regressed?",
file_paths=["v1.png", "v2.png"])
# Check a video
read_media(prompt="Is the animation smooth? Any glitches?", file_paths=["recording.mp4"])
# Verify audio output
read_media(prompt="Is the speech clear and natural?", file_paths=["output.mp3"])
Parameters
| Parameter | Description | Example |
|---|
prompt | What to analyze — focus on what to look at | "Evaluate the visual hierarchy and spacing" |
file_paths | List of file paths to analyze together | ["before.png", "after.png"] |
continue_from | conversation_id from a previous call for follow-ups | "conv_abc123" |
max_concurrent | Max parallel analyses (default 4) | 4 |
Before/After Comparison
All files are sent to the model in a SINGLE call — the model sees them
side-by-side. This is critical for comparison: if you send files in separate
calls, the model cannot compare them.
# CORRECT: both files in one call — model sees both
read_media(prompt="Compare before and after — what improved, what regressed?",
file_paths=["v1.png", "v2.png"])
# WRONG: separate calls — model has no memory of the first
read_media(prompt="Analyze this", file_paths=["v1.png"])
read_media(prompt="Now compare to previous", file_paths=["v2.png"]) # can't compare!
For follow-up questions on the same media, use continue_from:
# First call
result = read_media(prompt="Evaluate the design", file_paths=["page.png"])
# result contains conversation_id
# Follow-up (model remembers the image)
read_media(prompt="Now focus on the typography — is the hierarchy clear?",
continue_from="conv_abc123")
Supported Formats and Backends
Images
Formats: png, jpg, jpeg, gif, webp, bmp
| Backend | Default Model | API Key |
|---|
| Google Gemini (priority 1) | gemini-3.1-pro-preview | GOOGLE_API_KEY / GEMINI_API_KEY |
| OpenAI (priority 2) | gpt-5.4 | OPENAI_API_KEY |
| Claude (priority 3) | claude-sonnet-4-5-20250929 | ANTHROPIC_API_KEY |
| Grok (priority 4) | grok-4.20-0309-reasoning | XAI_API_KEY |
| OpenRouter (priority 5) | openai/gpt-5.2 | OPENROUTER_API_KEY |
Video
Formats: mp4, mov, avi, mkv, webm, gif, flv, wmv
| Backend | Default Model | Method | API Key |
|---|
| Google Gemini (priority 1) | gemini-3.1-pro-preview | Native video (no frame extraction) | GOOGLE_API_KEY |
| OpenAI (priority 2) | gpt-5.4 | Frame extraction (8 frames default) | OPENAI_API_KEY |
| Claude (priority 3) | claude-sonnet-4-5-20250929 | Frame extraction | ANTHROPIC_API_KEY |
Audio
Formats: mp3, wav, m4a, ogg, flac, aac
| Backend | Default Model | Mode | API Key |
|---|
| Google Gemini (priority 1) | gemini-3.1-pro-preview | Native audio understanding | GOOGLE_API_KEY |
| OpenAI (priority 2) | gpt-4o-audio-preview | Rich analysis (tone, emotion, pacing) | OPENAI_API_KEY |
Backend auto-selects based on available API keys (first available in priority order).
Verification by Deliverable Type
Match evidence to how the output is experienced. Classify by what happens
when a user opens it, not by file extension:
| What does it do? | Shallow (incomplete) | Full check (required) |
|---|
| Stays still (image, PDF, document) | File generates | Render and view every page/section with read_media |
| Moves (animation, video) | Single frame | Record video, review full motion sequence |
| Responds to input (website, app) | Screenshot looks good | Use it — click buttons, navigate, test states; screenshot each state |
| Produces output (script, API) | Runs without error | Test with varied inputs, capture output |
| Makes sound (audio, TTS) | File exists | Listen via read_media — don't just check file exists |
Coverage Check Before Diagnosis
Before concluding something is broken:
- Capture the full artifact — all pages, sections, states (not just one viewport)
- Scroll through long pages and multi-state flows
- Check for capture artifacts: blank regions may be timing/iframe/canvas issues
- If code suggests more output than captured, fix capture first
Prompts by Domain
Focus on what to look at — the critical lens is automatic:
| Domain | Prompt |
|---|
| Website/landing page | "Assess visual hierarchy, spacing, typography, and responsive layout. What feels generic?" |
| Generated image | "Does this match the brief? What's off about composition, color, or detail?" |
| Chart/diagram | "Is the data clearly communicated? What's misleading or hard to read?" |
| Video/animation | "Is the motion smooth and intentional? Any artifacts, jumps, or timing issues?" |
| Audio/TTS | "Is the speech clear, natural, and well-paced? Any distortion or pronunciation errors?" |
| Before/after | "What improved? What regressed? Be specific per dimension." |
Capturing Screenshots for Verification
npx -y playwright@latest screenshot http://localhost:8765 screenshot.png --full-page
npx -y playwright@latest screenshot http://localhost:8765 desktop.png --viewport-size=1440,900
npx -y playwright@latest screenshot http://localhost:8765 mobile.png --viewport-size=375,812
Then: read_media(prompt="Evaluate at desktop and mobile sizes", file_paths=["desktop.png", "mobile.png"])
API Key Check
On session start, .massgen-quality/environment.json is written with availability
status. Check capabilities.has_vision to confirm read_media will work before
attempting calls.