Exécutez n'importe quel Skill dans Manus
en un clic

Exécutez n'importe quel Skill dans Manus en un clic

$pwd:

evaluate-multimodal

Name: Evaluate Multimodal
Author: langwatch

// Evaluate multimodal AI agents that process images, audio, PDFs, or other files. Sets up evaluations using LangWatch's LLM-as-judge with image inputs, Scenario's multimodal testing, and document parsing evaluation patterns. Use when your agent handles non-text inputs.

Exécuter dans Manus

$ git log --oneline --stat

stars:2

forks:1

updated:24 avril 2026 à 09:38

SKILL.md

readonly

name	evaluate-multimodal
description	Evaluate multimodal AI agents that process images, audio, PDFs, or other files. Sets up evaluations using LangWatch's LLM-as-judge with image inputs, Scenario's multimodal testing, and document parsing evaluation patterns. Use when your agent handles non-text inputs.
license	MIT
compatibility	Requires LangWatch SDK and optionally @langwatch/scenario. Works with Claude Code and similar coding agents. Uses the `langwatch` CLI for documentation and platform operations.
metadata	{"category":"recipe"}

Evaluate Your Multimodal Agent

This recipe helps you evaluate agents that process images, audio, PDFs, or other non-text inputs.

Step 1: Identify Modalities

Read the codebase to understand what your agent processes:

Images: classification, analysis, generation, OCR
Audio: transcription, voice agents, audio Q&A
PDFs/Documents: parsing, extraction, summarization
Mixed: multiple input types in one pipeline

Step 2: Read the Relevant Docs

Use the langwatch CLI to fetch the right pages:

langwatch scenario-docs                            # Index — locate multimodal pages
langwatch scenario-docs multimodal/audio-to-text   # Audio testing patterns
langwatch scenario-docs multimodal/multimodal-files # Generic file analysis patterns
langwatch docs                                     # LangWatch docs index
langwatch docs evaluations/experiments/sdk         # Experiment SDK basics
langwatch docs evaluations/evaluators/list         # Browse evaluator types

For PDF evaluation specifically, reference the pattern from python-sdk/examples/pdf_parsing_evaluation.ipynb:

Download/load documents
Define extraction pipeline
Use LangWatch experiment SDK to evaluate extraction accuracy

Step 3: Set Up Evaluation by Modality

Image Evaluation

LangWatch's LLM-as-judge evaluators can accept images. Create an evaluation that:

Loads test images
Runs the agent on each image
Uses an LLM-as-judge evaluator to assess output quality

import langwatch

experiment = langwatch.experiment.init("image-eval")

for idx, entry in experiment.loop(enumerate(image_dataset)):
    result = my_agent(image=entry["image_path"])
    experiment.evaluate(
        "llm_boolean",
        index=idx,
        data={
            "input": entry["image_path"],  # LLM-as-judge can view images
            "output": result,
        },
        settings={
            "model": "openai/gpt-5-mini",
            "prompt": "Does the agent correctly describe/classify this image?",
        },
    )

Audio Evaluation

Use Scenario's audio testing patterns:

Audio-to-text: verify transcription accuracy
Audio-to-audio: verify voice agent responses

Read the dedicated guide:

langwatch scenario-docs multimodal/audio-to-text

PDF/Document Evaluation

Follow the pattern from the PDF parsing evaluation example:

Load documents (PDFs, CSVs, etc.)
Define extraction/parsing pipeline
Evaluate extraction accuracy against expected fields
Use structured evaluation (exact match for fields, LLM judge for summaries)

File Analysis

For agents that process arbitrary files, read the file analysis guide:

langwatch scenario-docs multimodal/multimodal-files

Step 4: Generate Domain-Specific Test Data

For each modality, generate or collect test data that matches the agent's actual use case:

If it's a medical imaging agent → use relevant medical image samples
If it's a document parser → use real document types the agent encounters
If it's a voice assistant → record realistic voice prompts

Step 5: Run and Iterate

Run the evaluation, review results, fix issues, re-run until quality is acceptable.

Common Mistakes

Do NOT evaluate multimodal agents with text-only metrics — use image-aware judges
Do NOT skip testing with real file formats — synthetic descriptions aren't enough
Do NOT forget to handle file loading errors in evaluations
Do NOT use generic test images — use domain-specific ones matching the agent's purpose
Always read the relevant langwatch scenario-docs ... page for the modality before writing code; multimodal patterns differ a lot from text-only ones

related-skills.json

même dépôt

datasets.md

from "langwatch/skills"

Generate realistic synthetic evaluation datasets by analyzing the user's codebase, prompts, production traces, and reference materials. Interactive, consultant-style — asks clarifying questions, proposes a plan, generates a preview for approval, then delivers a complete dataset uploaded to LangWatch. Use when user asks to generate, create, or build a dataset for evaluation, testing, or benchmarking.

2026-04-282

analytics.md

from "langwatch/skills"

Analyze your AI agent's performance using LangWatch analytics. Use when the user wants to understand costs, latency, error rates, usage trends, or debug specific traces. Works with any LangWatch-instrumented agent.

2026-04-242

evaluations.md

from "langwatch/skills"

Set up comprehensive evaluations for your AI agent with LangWatch — experiments (batch testing), evaluators (scoring functions), datasets, online evaluation (production monitoring), and guardrails (real-time blocking). Supports both code (SDK) and platform (CLI) approaches. Use when the user wants to evaluate, test, benchmark, monitor, or safeguard their agent.

2026-04-242

level-up.md

from "langwatch/skills"

Take your AI agent to the next level with full LangWatch integration. Adds tracing, prompt versioning, evaluation experiments, and simulation tests in one go. Use when the user wants comprehensive observability, testing, and prompt management for their agent.

2026-04-242

prompts.md

from "langwatch/skills"

Version and manage your agent's prompts with LangWatch Prompts CLI. Use for both onboarding (set up prompt versioning for an entire codebase) and targeted operations (version a specific prompt, create a new prompt version). Supports Python and TypeScript.

2026-04-242

debug-instrumentation.md

from "langwatch/skills"

Debug and improve your LangWatch traces. Inspects production traces for missing input/output, disconnected spans, unlabeled traces, and missing metadata. Use when traces look broken or incomplete.

2026-04-242

package.json

"author": "langwatch"

"repository": "langwatch/skills"

Ouvrir le dépôt GitHub Voir les dépôts du créateur

$ install --global

$ download --local

Exécuter dans Manus

$ useful --forSOC

Analystes en assurance qualité des logiciels et testeursProfessions informatiques et mathématiques15-1253L4

name	evaluate-multimodal
description	Evaluate multimodal AI agents that process images, audio, PDFs, or other files. Sets up evaluations using LangWatch's LLM-as-judge with image inputs, Scenario's multimodal testing, and document parsing evaluation patterns. Use when your agent handles non-text inputs.
license	MIT
compatibility	Requires LangWatch SDK and optionally @langwatch/scenario. Works with Claude Code and similar coding agents. Uses the `langwatch` CLI for documentation and platform operations.
metadata	{"category":"recipe"}

Evaluate Your Multimodal Agent

This recipe helps you evaluate agents that process images, audio, PDFs, or other non-text inputs.

Step 1: Identify Modalities

Read the codebase to understand what your agent processes:

Images: classification, analysis, generation, OCR
Audio: transcription, voice agents, audio Q&A
PDFs/Documents: parsing, extraction, summarization
Mixed: multiple input types in one pipeline

Step 2: Read the Relevant Docs

Use the langwatch CLI to fetch the right pages:

langwatch scenario-docs                            # Index — locate multimodal pages
langwatch scenario-docs multimodal/audio-to-text   # Audio testing patterns
langwatch scenario-docs multimodal/multimodal-files # Generic file analysis patterns
langwatch docs                                     # LangWatch docs index
langwatch docs evaluations/experiments/sdk         # Experiment SDK basics
langwatch docs evaluations/evaluators/list         # Browse evaluator types

For PDF evaluation specifically, reference the pattern from python-sdk/examples/pdf_parsing_evaluation.ipynb:

Download/load documents
Define extraction pipeline
Use LangWatch experiment SDK to evaluate extraction accuracy

Step 3: Set Up Evaluation by Modality

Image Evaluation

LangWatch's LLM-as-judge evaluators can accept images. Create an evaluation that:

Loads test images
Runs the agent on each image
Uses an LLM-as-judge evaluator to assess output quality

import langwatch

experiment = langwatch.experiment.init("image-eval")

for idx, entry in experiment.loop(enumerate(image_dataset)):
    result = my_agent(image=entry["image_path"])
    experiment.evaluate(
        "llm_boolean",
        index=idx,
        data={
            "input": entry["image_path"],  # LLM-as-judge can view images
            "output": result,
        },
        settings={
            "model": "openai/gpt-5-mini",
            "prompt": "Does the agent correctly describe/classify this image?",
        },
    )

Audio Evaluation

Use Scenario's audio testing patterns:

Audio-to-text: verify transcription accuracy
Audio-to-audio: verify voice agent responses

Read the dedicated guide:

langwatch scenario-docs multimodal/audio-to-text

PDF/Document Evaluation

Follow the pattern from the PDF parsing evaluation example:

Load documents (PDFs, CSVs, etc.)
Define extraction/parsing pipeline
Evaluate extraction accuracy against expected fields
Use structured evaluation (exact match for fields, LLM judge for summaries)

File Analysis

For agents that process arbitrary files, read the file analysis guide:

langwatch scenario-docs multimodal/multimodal-files

Step 4: Generate Domain-Specific Test Data

For each modality, generate or collect test data that matches the agent's actual use case:

If it's a medical imaging agent → use relevant medical image samples
If it's a document parser → use real document types the agent encounters
If it's a voice assistant → record realistic voice prompts

Step 5: Run and Iterate

Run the evaluation, review results, fix issues, re-run until quality is acceptable.

Common Mistakes

Do NOT evaluate multimodal agents with text-only metrics — use image-aware judges
Do NOT skip testing with real file formats — synthetic descriptions aren't enough
Do NOT forget to handle file loading errors in evaluations
Do NOT use generic test images — use domain-specific ones matching the agent's purpose
Always read the relevant langwatch scenario-docs ... page for the modality before writing code; multimodal patterns differ a lot from text-only ones

evaluate-multimodal

Evaluate Your Multimodal Agent

Step 1: Identify Modalities

Step 2: Read the Relevant Docs

Step 3: Set Up Evaluation by Modality

Image Evaluation

Audio Evaluation

PDF/Document Evaluation

File Analysis

Step 4: Generate Domain-Specific Test Data

Step 5: Run and Iterate

Common Mistakes

Plus depuis ce dépôt

Evaluate Your Multimodal Agent

Step 1: Identify Modalities

Step 2: Read the Relevant Docs

Step 3: Set Up Evaluation by Modality

Image Evaluation

Audio Evaluation

PDF/Document Evaluation

File Analysis

Step 4: Generate Domain-Specific Test Data

Step 5: Run and Iterate

Common Mistakes

Plus depuis ce dépôt