| name | evaluate-multimodal |
| description | Evaluate multimodal AI agents that process images, audio, PDFs, or other files. Sets up evaluations using LangWatch's LLM-as-judge with image inputs, Scenario's multimodal testing, and document parsing evaluation patterns. Use when your agent handles non-text inputs. |
| license | MIT |
| compatibility | Requires LangWatch SDK and optionally @langwatch/scenario. Works with Claude Code and similar coding agents. Uses the `langwatch` CLI for documentation and platform operations. |
| metadata | {"category":"recipe"} |
Evaluate Your Multimodal Agent
This recipe helps you evaluate agents that process images, audio, PDFs, or other non-text inputs.
Step 1: Identify Modalities
Read the codebase to understand what your agent processes:
- Images: classification, analysis, generation, OCR
- Audio: transcription, voice agents, audio Q&A
- PDFs/Documents: parsing, extraction, summarization
- Mixed: multiple input types in one pipeline
Step 2: Read the Relevant Docs
Use the langwatch CLI to fetch the right pages:
langwatch scenario-docs
langwatch scenario-docs multimodal/audio-to-text
langwatch scenario-docs multimodal/multimodal-files
langwatch docs
langwatch docs evaluations/experiments/sdk
langwatch docs evaluations/evaluators/list
For PDF evaluation specifically, reference the pattern from python-sdk/examples/pdf_parsing_evaluation.ipynb:
- Download/load documents
- Define extraction pipeline
- Use LangWatch experiment SDK to evaluate extraction accuracy
Step 3: Set Up Evaluation by Modality
Image Evaluation
LangWatch's LLM-as-judge evaluators can accept images. Create an evaluation that:
- Loads test images
- Runs the agent on each image
- Uses an LLM-as-judge evaluator to assess output quality
import langwatch
experiment = langwatch.experiment.init("image-eval")
for idx, entry in experiment.loop(enumerate(image_dataset)):
result = my_agent(image=entry["image_path"])
experiment.evaluate(
"llm_boolean",
index=idx,
data={
"input": entry["image_path"],
"output": result,
},
settings={
"model": "openai/gpt-5-mini",
"prompt": "Does the agent correctly describe/classify this image?",
},
)
Audio Evaluation
Use Scenario's audio testing patterns:
- Audio-to-text: verify transcription accuracy
- Audio-to-audio: verify voice agent responses
Read the dedicated guide:
langwatch scenario-docs multimodal/audio-to-text
PDF/Document Evaluation
Follow the pattern from the PDF parsing evaluation example:
- Load documents (PDFs, CSVs, etc.)
- Define extraction/parsing pipeline
- Evaluate extraction accuracy against expected fields
- Use structured evaluation (exact match for fields, LLM judge for summaries)
File Analysis
For agents that process arbitrary files, read the file analysis guide:
langwatch scenario-docs multimodal/multimodal-files
Step 4: Generate Domain-Specific Test Data
For each modality, generate or collect test data that matches the agent's actual use case:
- If it's a medical imaging agent → use relevant medical image samples
- If it's a document parser → use real document types the agent encounters
- If it's a voice assistant → record realistic voice prompts
Step 5: Run and Iterate
Run the evaluation, review results, fix issues, re-run until quality is acceptable.
Common Mistakes
- Do NOT evaluate multimodal agents with text-only metrics — use image-aware judges
- Do NOT skip testing with real file formats — synthetic descriptions aren't enough
- Do NOT forget to handle file loading errors in evaluations
- Do NOT use generic test images — use domain-specific ones matching the agent's purpose
- Always read the relevant
langwatch scenario-docs ... page for the modality before writing code; multimodal patterns differ a lot from text-only ones