ワンクリックでManusで任意のスキルを実行

$pwd:

ag2-multimodal-input

Name: Ag2 Multimodal Input
Author: ag2ai

// Send images, audio, video, or documents into an AG2 beta `Agent` alongside text. Pass `ImageInput`, `AudioInput`, `VideoInput`, or `DocumentInput` as positional args to `agent.ask(...)`. Use when the user wants the agent to process non-text input — describe a photo, transcribe audio, summarise a PDF, analyse a video. Covers per-provider support matrix, the four ways to source data (URL / path / bytes / file_id), Gemini-specific YouTube + media-resolution + clipping, OpenAI image-detail, Anthropic prompt-caching on attachments, and `FilesAPI` for upload lifecycle.

Manusで実行

$ git log --oneline --stat

stars:240

forks:75

updated:2026年4月30日 23:37

SKILL.md

readonly

related-skills.json

同じリポジトリ

ag2-knowledge-and-memory.md

from "ag2ai/build-with-ag2"

Persist agent state across runs, shape what the LLM sees per turn, and cap history to fit a context window. Covers `KnowledgeStore` (memory / sqlite / disk / redis), `KnowledgeConfig` (`store=`, `compact=`, `aggregate=`, `bootstrap=`), aggregation strategies (`WorkingMemoryAggregate`, `ConversationSummaryAggregate`), assembly policies (`WorkingMemoryPolicy`, `EpisodicMemoryPolicy`, `ConversationPolicy`, `SlidingWindowPolicy`, `TokenBudgetPolicy`, `AlertPolicy`), and compaction (`TailWindowCompact`, `SummarizeCompact`). Use when the user wants the agent to remember between conversations, manage long histories, or control prompt assembly.

2026-04-30240

ag2-observers-and-alerts.md

from "ag2ai/build-with-ag2"

Monitor an AG2 beta agent's stream — log events, detect repeated tool calls, track token spend, build trigger-driven observers, route observer alerts to the model, and halt on FATAL conditions. Covers `@observer(...)` (stateless), `BaseObserver` (stateful), built-ins (`TokenMonitor`, `LoopDetector`), `Watch` primitives (`EventWatch`, `CadenceWatch`, `DelayWatch`, `IntervalWatch`, `CronWatch`, `AllOf`, `AnyOf`, `Sequence`), `ObserverAlert` (`Severity.INFO/WARNING/CRITICAL/FATAL`), `AlertPolicy`, and `HaltEvent`. Use when the user wants observability, runtime safety guards, alerts, or batch/time-based reactive logic.

2026-04-30240

ag2-add-custom-tool.md

from "ag2ai/build-with-ag2"

Add a custom Python tool to an AG2 beta `Agent` using the `@tool` decorator. Use when the user wants to give an Agent a new capability backed by Python code (API calls, DB queries, computations, file ops). Covers sync and async tools, parameter typing, Pydantic schema customisation, returning typed `Input` / `ToolResult` (text / data / images / binary), `final=True` early-exit, and dependency injection via `Context` / `Inject` / `Variable` / `Depends`.

2026-04-30240

ag2-ag-ui.md

from "ag2ai/build-with-ag2"

Expose an AG2 beta `Agent` over the AG-UI protocol so a frontend (CopilotKit, custom React/Next.js, or any AG-UI client) can stream responses, render tool calls, sync shared state, and surface human-input checkpoints. Wraps the agent with `AGUIStream(agent)` and mounts it in FastAPI via `stream.dispatch(...)` or `stream.build_asgi()`. Use when the user wants a web frontend in front of an AG2 agent rather than a CLI / script.

2026-04-30240

ag2-hitl.md

from "ag2ai/build-with-ag2"

Pause an AG2 beta `Agent` mid-run to collect human input via `context.input()`, or gate a tool call with `approval_required()` middleware. Use when the user wants the agent to ask for confirmation, request missing info (passwords, API keys, data), or have a human approve sensitive / irreversible / expensive tool calls (sending emails, deleting records, payments).

2026-04-30240

ag2-middleware.md

from "ag2ai/build-with-ag2"

Intercept the AG2 beta agent loop with `BaseMiddleware` — wrap full turns (`on_turn`), each LLM call (`on_llm_call`), each tool execution (`on_tool_execution`), or each human-input request (`on_human_input`). Use for retry, logging, history trimming, request mutation, tool auditing, guardrails, or rate limiting. Built-ins: `LoggingMiddleware`, `RetryMiddleware`, `HistoryLimiter`, `TokenLimiter`, `TelemetryMiddleware` (see `ag2-telemetry`). For per-tool hooks see also `ag2-add-custom-tool` tool-middleware section.

2026-04-30240

package.json

"author": "ag2ai"

"repository": "ag2ai/build-with-ag2"

GitHub リポジトリを開く Creator のリポジトリを見る

$ install --global

$ download --local

Manusで実行

$ useful --forSOC

ソフトウェア開発者コンピュータ・数学職15-1252L4

name	ag2-multimodal-input
description	Send images, audio, video, or documents into an AG2 beta `Agent` alongside text. Pass `ImageInput`, `AudioInput`, `VideoInput`, or `DocumentInput` as positional args to `agent.ask(...)`. Use when the user wants the agent to process non-text input — describe a photo, transcribe audio, summarise a PDF, analyse a video. Covers per-provider support matrix, the four ways to source data (URL / path / bytes / file_id), Gemini-specific YouTube + media-resolution + clipping, OpenAI image-detail, Anthropic prompt-caching on attachments, and `FilesAPI` for upload lifecycle.
license	Apache-2.0

Multimodal inputs

When to use

The user wants the agent to process non-text input: an image to describe, audio to transcribe, video to summarise, or a PDF / document to extract from. The same factory pattern works across providers; per-provider support varies.

60-second recipe

from autogen.beta import Agent
from autogen.beta.config import GeminiConfig
from autogen.beta.events import ImageInput

agent = Agent(
    "vision",
    "You describe images.",
    config=GeminiConfig(model="gemini-3-flash-preview"),
)

image = ImageInput("https://example.com/photo.jpg")
reply = await agent.ask("Describe this image in detail.", image)
print(reply.body)

Multiple inputs in one ask are fine:

reply = await agent.ask(
    "Compare these two images.",
    ImageInput("https://example.com/before.jpg"),
    ImageInput("https://example.com/after.jpg"),
)

Input factories

Factory	Formats
`ImageInput(...)`	JPEG, PNG, GIF, WebP
`AudioInput(...)`	WAV, MP3, OGG, FLAC, AAC
`VideoInput(...)`	MP4, WebM, MOV, MKV, MPEG
`DocumentInput(...)`	PDF, TXT, HTML, Markdown, CSV, JSON, Office formats

Each accepts the same four data sources:

from autogen.beta.events import ImageInput

ImageInput("https://example.com/photo.jpg")     # URL
ImageInput(path="photo.jpg")                    # local file
ImageInput(data=raw_bytes, media_type="image/png")  # bytes
ImageInput(file_id="file-abc123")               # provider-uploaded

Provider matrix

Input type	OpenAI	OpenAI Responses	Gemini	Anthropic
Text	✓	✓	✓	✓
Image (URL)	✓	✓	✓	✓
Image (binary)	✓	✓	✓	✓
Audio (URL)	–	–	✓	–
Audio (binary)	✓	–	✓	–
Video (URL)	–	–	✓	–
Video (binary)	–	–	✓	–
Document (URL)	–	✓	✓	✓
Document (binary)	–	–	✓	✓
File ID	–	✓	–	✓

Unsupported combinations raise UnsupportedInputError with a clear message.

Gemini has the broadest multimodal support. If you don't know which provider to pick for a multimodal task, start there.

Provider-specific niceties

Gemini — YouTube URLs work directly

from autogen.beta.events import VideoInput

video = VideoInput("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
reply = await agent.ask("Summarize this video.", video)

Gemini — large files (> 20MB) via Google Files API

from google import genai
from autogen.beta.events import VideoInput
import time

client = genai.Client()
uploaded = client.files.upload(file="large_video.mp4")
while uploaded.state.name == "PROCESSING":
    time.sleep(2)
    uploaded = client.files.get(name=uploaded.name)

video = VideoInput(uploaded.uri)

Gemini — `vendor_metadata`

Key	Purpose
`media_resolution`	`MEDIA_RESOLUTION_LOW/MEDIUM/HIGH/ULTRA_HIGH` — token vs cost
`video_metadata`	Clipping (`start_offset`, `end_offset`) and `fps`
`display_name`	Display name for the file

ImageInput(data=raw, media_type="image/jpeg", vendor_metadata={"media_resolution": "MEDIA_RESOLUTION_LOW"})

VideoInput(path="lecture.mp4", vendor_metadata={
    "video_metadata": {"start_offset": "60s", "end_offset": "120s", "fps": 0.5},
})

OpenAI — image detail

ImageInput(data=raw, media_type="image/png", vendor_metadata={"detail": "low"})  # "low" | "high" | "auto"

Anthropic — File ID + prompt caching

import anthropic
from autogen.beta.events import ImageInput, DocumentInput

client = anthropic.Anthropic()
uploaded = client.beta.files.upload(file=("photo.jpg", open("photo.jpg", "rb"), "image/jpeg"))

# filename determines block type (image vs document)
image = ImageInput(file_id=uploaded.id, filename="photo.jpg")

# Cache an attachment so subsequent turns skip re-uploading
doc = DocumentInput(path="report.pdf", vendor_metadata={"cache_control": {"type": "ephemeral"}})

`FilesAPI` — upload lifecycle, provider-agnostic

For any provider that has a file API (OpenAIConfig, OpenAIResponsesConfig, AnthropicConfig, GeminiConfig):

from autogen.beta import FilesAPI
from autogen.beta.config import OpenAIResponsesConfig

files = FilesAPI(OpenAIResponsesConfig(model="gpt-5-mini"))

uploaded = await files.upload(path="report.pdf", purpose="assistants")
print(uploaded.file_id)

# Or from bytes (filename required)
uploaded = await files.upload(data=b"...", filename="hello.txt", purpose="assistants")

# List, read, delete
all_files = await files.list()
data = await files.read(uploaded.file_id)        # NotImplementedError on Gemini
await files.delete(uploaded.file_id)

Pass the file_id to DocumentInput, ImageInput, etc.:

from autogen.beta.events import DocumentInput

doc = DocumentInput(file_id=uploaded.file_id)
reply = await agent.ask("Summarize this report.", doc)

Going deeper

website/docs/beta/inputs/inputs.mdx — full provider matrix and vendor_metadata reference.
website/docs/beta/advanced/files.mdx — FilesAPI reference (upload / list / read / delete).
For tools that return images / binary back to the LLM, see ag2-add-custom-tool (ImageInput, BinaryInput, ToolResult).

Common pitfalls

Picking a provider that doesn't support your input type — silently you'll get UnsupportedInputError. Check the matrix; Gemini is broadest.
FilesAPI.read() on Gemini — raises NotImplementedError. Gemini doesn't expose download.
Calling files.upload(data=...) without filename= — raises ValueError. Filename is required for in-memory uploads.
Providing path= and data= to the same factory — pick one source. Same for file_id=.
Anthropic ImageInput(file_id=...) without filename= — Anthropic decides block type (image vs document) by filename extension. Pass it.
Gemini vendor_metadata keys are nested — video_metadata itself takes a dict. Check the doc table for shape.
Forgetting to wait for Gemini file processing — large uploads have a PROCESSING state. Poll client.files.get(name=...) until ready before referencing the URI.

ag2-multimodal-input

このリポジトリの他の Skills

このリポジトリの他の Skills

Multimodal inputs

When to use

60-second recipe

Input factories

Provider matrix

Provider-specific niceties

Gemini — YouTube URLs work directly

Gemini — large files (> 20MB) via Google Files API

Gemini — vendor_metadata

OpenAI — image detail

Anthropic — File ID + prompt caching

FilesAPI — upload lifecycle, provider-agnostic

Going deeper

Common pitfalls

Multimodal inputs

When to use

60-second recipe

Input factories

Provider matrix

Provider-specific niceties

Gemini — YouTube URLs work directly

Gemini — large files (> 20MB) via Google Files API

Gemini — vendor_metadata

OpenAI — image detail

Anthropic — File ID + prompt caching

FilesAPI — upload lifecycle, provider-agnostic

Going deeper

Common pitfalls

Gemini — `vendor_metadata`

`FilesAPI` — upload lifecycle, provider-agnostic

Gemini — `vendor_metadata`

`FilesAPI` — upload lifecycle, provider-agnostic