一键在 Manus 中运行任何 Skill

ai-agent-development

AI agent architecture and development patterns including tool use, memory systems, planning loops, and multi-agent orchestration. Use when building AI agents, designing tool interfaces, or implementing agent evaluation.

在 Manus 中运行

星标0

分支0

更新时间2026年6月4日 19:36

来源

bendourthe

bendourthe/Nexus-Hub

打开 GitHub 仓库查看创作者相关仓库

安装命令

下载

在 Manus 中运行

文件资源管理器

5 个文件

SKILL.md

readonly

同仓库更多 Skills

同仓库

prompt-engineering

bendourthe/Nexus-Hub

Prompt engineering principles and techniques for LLM applications including system prompts, chain-of-thought, few-shot learning, and prompt evaluation. Use when designing prompts, optimizing LLM outputs, or building prompt pipelines.

2026-06-040

api-design

bendourthe/Nexus-Hub

API design principles for REST, GraphQL, and gRPC including versioning, pagination, error handling, and documentation. Use when designing new APIs, reviewing API contracts, or migrating between API styles.

2026-06-040

architecture-design

bendourthe/Nexus-Hub

System architecture design including requirements analysis, trade-off evaluation, ADRs, and system decomposition. Use when designing new systems, evaluating architectures, or documenting design decisions.

2026-06-040

business-analyst

bendourthe/Nexus-Hub

Business analysis expertise for translating business needs into technical requirements. Use when eliciting requirements from stakeholders, modeling business processes, writing functional specifications, performing gap analysis, defining data dictionaries, or creating acceptance test scenarios from business rules.

2026-06-040

product-manager

bendourthe/Nexus-Hub

Product management analysis for engineering-informed decision framing. Use when a task needs product framing, feature prioritization based on user impact and engineering reality, scope control to prevent complexity creep, or structured now/next/later sequencing with explicit tradeoffs.

2026-06-040

scrum-master

bendourthe/Nexus-Hub

Scrum and agile facilitation expertise for engineering teams. Use when planning sprints, facilitating retrospectives, removing blockers, tracking velocity and burndown, improving team processes, or coaching teams on agile principles and practices.

2026-06-040

一键运行任何 Skill

name	ai-agent-development
description	AI agent architecture and development patterns including tool use, memory systems, planning loops, and multi-agent orchestration. Use when building AI agents, designing tool interfaces, or implementing agent evaluation.
summary_l0	Build AI agents with tool use, memory, planning loops, and multi-agent orchestration
overview_l1	This skill provides comprehensive architecture and implementation guidance for building AI agents that reason, plan, use tools, and collaborate. Use it when designing agent architectures (ReAct, plan-and-execute, reflection), implementing tool-use patterns with function calling or MCP, building memory systems (short-term, long-term, episodic, semantic), creating planning and reasoning loops, orchestrating multi-agent systems, or adding guardrails and evaluation. Key capabilities include architecture pattern selection, tool integration schemas, memory system design, planning loop implementation, multi-agent topologies (supervisor, peer-to-peer, pipeline), guardrail enforcement, trajectory-based evaluation, and observability instrumentation. The expected output is production-grade agent code with structured logging, error handling, and cost tracking. Trigger phrases: build an agent, agent architecture, tool use, function calling, agent memory, planning loop, multi-agent, agent orchestration, agent evaluation, agent guardrails, ReAct pattern, MCP server.

AI Agent Development

Comprehensive patterns and implementation guidance for building AI agents that reason, plan, use tools, and collaborate. Covers the full spectrum from single-agent architectures to multi-agent orchestration, with production-grade error handling, observability, and evaluation.

Effort-level note: when agents run inside loops or parallel fan-out, default the effortLevel to high, not max. Iterative and concurrent runs compound effort-level cost without matching quality gains. See the Effort-Level Strategy section of prompt-engineering/SKILL.md for the decision table. A full Opus 4.7 Anti-Patterns table for agent workloads is maintained alongside this skill.

When to Use This Skill

Use this skill for:

Designing agent architectures (ReAct, plan-and-execute, reflection)
Implementing tool-use patterns with function calling or MCP
Building memory systems (short-term, long-term, episodic, semantic)
Creating planning and reasoning loops
Orchestrating multi-agent systems
Adding guardrails and safety layers to agents
Evaluating and testing agent behavior
Instrumenting agents for observability

Trigger phrases: "build an agent", "agent architecture", "tool use", "function calling", "agent memory", "planning loop", "multi-agent", "agent orchestration", "agent evaluation", "agent guardrails", "ReAct pattern", "MCP server"

What This Skill Does

Provides agent development expertise including:

Architecture Patterns: ReAct, plan-and-execute, reflection, and hybrid approaches
Tool Integration: Function calling schemas, MCP servers, tool selection logic
Memory Systems: Working memory, long-term stores, episodic recall, semantic search
Planning Loops: Goal decomposition, step execution, replanning on failure
Multi-Agent Orchestration: Supervisor, peer-to-peer, and pipeline topologies
Guardrails: Input validation, output filtering, budget limits, human-in-the-loop
Evaluation: Task completion metrics, trajectory analysis, regression testing
Observability: Structured logging, trace propagation, cost tracking

Instructions

Step 1: Choose an Agent Architecture

Select an architecture based on task complexity, latency requirements, and reliability needs.

Architecture Comparison:

Architecture	Best For	Latency	Reliability	Complexity
ReAct	Simple tool-use tasks	Low	Medium	Low
Plan-and-Execute	Multi-step workflows	Medium	High	Medium
Reflection	Quality-critical outputs	High	High	Medium
Multi-Agent	Complex domain tasks	High	Variable	High

ReAct (Reason + Act) Pattern:

The agent interleaves reasoning and action in a loop: think about what to do, execute a tool, observe the result, repeat.

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "search_codebase",
        "description": (
            "Search the codebase for files matching a regex pattern. "
            "Returns file paths and matching line contents. "
            "Use when looking for function definitions, imports, or string patterns."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "pattern": {
                    "type": "string",
                    "description": "Regex pattern to search for."
                },
                "file_glob": {
                    "type": "string",
                    "description": "Optional glob to filter files (e.g., '*.py').",
                    "default": "*"
                }
            },
            "required": ["pattern"]
        }
    },
    {
        "name": "read_file",
        "description": (
            "Read the full contents of a file by path. "
            "Use when you need to examine a specific file found via search."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Absolute or workspace-relative file path."
                }
            },
            "required": ["path"]
        }
    }
]


def run_react_agent(user_query: str, max_turns: int = 10) -> str:
    """Run a ReAct agent loop with Claude."""
    messages = [{"role": "user", "content": user_query}]

    for turn in range(max_turns):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=(
                "You are a code analysis agent. Think step-by-step about what "
                "information you need, use tools to gather it, then provide "
                "your analysis. Always explain your reasoning before acting."
            ),
            tools=tools,
            messages=messages,
        )

        # If the model wants to use tools, execute them
        if response.stop_reason == "tool_use":
            # Append the assistant's response (contains tool_use blocks)
            messages.append({"role": "assistant", "content": response.content})

            # Execute each tool call and collect results
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })

            messages.append({"role": "user", "content": tool_results})
        else:
            # Model produced a final text response
            return extract_text(response.content)

    return "Agent reached maximum turn limit without completing."

Plan-and-Execute Pattern:

Separate planning from execution. The planner decomposes the goal into steps; the executor handles each step independently.

from dataclasses import dataclass, field


@dataclass
class Plan:
    goal: str
    steps: list[str] = field(default_factory=list)
    completed: list[int] = field(default_factory=list)
    results: dict[int, str] = field(default_factory=dict)


def create_plan(goal: str) -> Plan:
    """Use the LLM to decompose a goal into executable steps."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system=(
            "You are a planning agent. Decompose the user's goal into "
            "a numbered list of concrete, independent steps. Each step "
            "should be actionable with available tools. Output ONLY the "
            "numbered list, nothing else."
        ),
        messages=[{"role": "user", "content": goal}],
    )
    steps = parse_numbered_list(extract_text(response.content))
    return Plan(goal=goal, steps=steps)


def execute_plan(plan: Plan) -> str:
    """Execute each step, replanning if a step fails."""
    for i, step in enumerate(plan.steps):
        if i in plan.completed:
            continue

        result = run_react_agent(
            f"Execute this step: {step}\n\n"
            f"Context from previous steps:\n{format_results(plan.results)}"
        )

        if is_failure(result):
            # Replan from this point forward
            revised = replan(plan, i, result)
            plan.steps = plan.steps[:i] + revised
            continue

        plan.results[i] = result
        plan.completed.append(i)

    return synthesize_results(plan)

Step 2: Implement Tool Integration

Tools are the agent's interface to the external world. Design them for LLM consumption (see the tool-design skill for detailed guidance).

Function Calling Schema Best Practices:

# Good: Specific, documented, constrained
file_edit_tool = {
    "name": "edit_file",
    "description": (
        "Replace a specific string in a file with new content. "
        "The old_string must appear exactly once in the file. "
        "Use when you need to modify existing code. "
        "Do NOT use for creating new files (use write_file instead)."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "file_path": {
                "type": "string",
                "description": "Absolute path to the file to edit."
            },
            "old_string": {
                "type": "string",
                "description": "The exact text to find and replace. Must be unique in the file."
            },
            "new_string": {
                "type": "string",
                "description": "The replacement text. Must differ from old_string."
            }
        },
        "required": ["file_path", "old_string", "new_string"]
    }
}

MCP Server Integration:

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client


async def connect_mcp_server(command: str, args: list[str]) -> ClientSession:
    """Connect to an MCP server and return a session."""
    server_params = StdioServerParameters(command=command, args=args)

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # List available tools
            tools_response = await session.list_tools()
            for tool in tools_response.tools:
                print(f"  {tool.name}: {tool.description}")

            return session


async def call_mcp_tool(session: ClientSession, name: str, arguments: dict):
    """Call a tool on the MCP server and return the result."""
    result = await session.call_tool(name, arguments=arguments)
    return result.content

Tool Execution with Error Handling:

import json
import traceback


def execute_tool(name: str, arguments: dict) -> str:
    """Execute a tool call with structured error handling."""
    try:
        if name == "search_codebase":
            return search_codebase(**arguments)
        elif name == "read_file":
            return read_file(**arguments)
        elif name == "edit_file":
            return edit_file(**arguments)
        else:
            return json.dumps({
                "error": f"Unknown tool: {name}",
                "available_tools": ["search_codebase", "read_file", "edit_file"],
                "suggestion": "Check the tool name and try again."
            })
    except FileNotFoundError as e:
        return json.dumps({
            "error": f"File not found: {e}",
            "suggestion": "Use search_codebase to find the correct file path."
        })
    except PermissionError as e:
        return json.dumps({
            "error": f"Permission denied: {e}",
            "recoverable": False,
            "suggestion": "This file cannot be modified. Ask the user for guidance."
        })
    except Exception as e:
        return json.dumps({
            "error": f"{type(e).__name__}: {e}",
            "traceback": traceback.format_exc(),
            "suggestion": "An unexpected error occurred. Try a different approach."
        })

Step 3: Build Memory Systems

Agents need memory to maintain context across turns, learn from past interactions, and recall relevant information.

Memory Type Overview:

Type	Scope	Storage	Use Case
Working	Current conversation	Message history	Immediate context
Episodic	Past interactions	Database	Recall similar tasks
Semantic	Domain knowledge	Vector store	Fact retrieval
Procedural	Learned workflows	Key-value store	Skill reuse

Working Memory (Conversation Buffer with Summarization):

from dataclasses import dataclass


@dataclass
class WorkingMemory:
    messages: list[dict]
    max_tokens: int = 100_000
    summary: str = ""

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        if self.estimate_tokens() > self.max_tokens * 0.8:
            self.compact()

    def compact(self):
        """Summarize older messages to free context space."""
        # Keep the most recent messages intact
        keep_recent = 6
        old_messages = self.messages[:-keep_recent]
        recent_messages = self.messages[-keep_recent:]

        summary_prompt = (
            f"Previous summary: {self.summary}\n\n"
            f"New messages to summarize:\n"
            f"{format_messages(old_messages)}\n\n"
            "Produce a concise summary preserving key decisions, "
            "findings, and action items."
        )
        self.summary = call_llm_for_summary(summary_prompt)
        self.messages = recent_messages

    def get_context(self) -> list[dict]:
        """Return messages with summary prepended if available."""
        if self.summary:
            summary_msg = {
                "role": "user",
                "content": f"[Context from earlier in this conversation]\n{self.summary}"
            }
            return [summary_msg] + self.messages
        return self.messages

    def estimate_tokens(self) -> int:
        return sum(len(m["content"]) // 4 for m in self.messages)

Long-Term Episodic Memory:

import hashlib
from datetime import datetime


class EpisodicMemory:
    """Store and recall past agent interactions by similarity."""

    def __init__(self, vector_store, embedding_model):
        self.store = vector_store
        self.embedder = embedding_model

    def record_episode(self, task: str, trajectory: list[dict], outcome: str):
        """Save a completed task episode for future recall."""
        episode = {
            "task": task,
            "trajectory_summary": summarize_trajectory(trajectory),
            "outcome": outcome,
            "timestamp": datetime.utcnow().isoformat(),
            "id": hashlib.sha256(f"{task}{datetime.utcnow()}".encode()).hexdigest()[:16],
        }
        embedding = self.embedder.embed(f"{task} {outcome}")
        self.store.upsert(id=episode["id"], vector=embedding, metadata=episode)

    def recall(self, current_task: str, top_k: int = 3) -> list[dict]:
        """Recall past episodes similar to the current task."""
        query_embedding = self.embedder.embed(current_task)
        results = self.store.query(vector=query_embedding, top_k=top_k)
        return [
            {
                "task": r.metadata["task"],
                "approach": r.metadata["trajectory_summary"],
                "outcome": r.metadata["outcome"],
                "similarity": r.score,
            }
            for r in results
        ]

Step 4: Implement Planning and Reasoning Loops

Reflection Pattern (Self-Critique Loop):

def reflect_and_improve(task: str, max_iterations: int = 3) -> str:
    """Generate output, critique it, and iterate until quality threshold is met."""
    draft = generate_initial_response(task)

    for iteration in range(max_iterations):
        critique = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            messages=[
                {
                    "role": "user",
                    "content": (
                        f"Task: {task}\n\n"
                        f"Current output:\n{draft}\n\n"
                        "Critique this output. Identify specific problems with:\n"
                        "1. Correctness: Are there factual or logical errors?\n"
                        "2. Completeness: Is anything missing?\n"
                        "3. Quality: Could the structure or clarity improve?\n\n"
                        "If the output is satisfactory, respond with ONLY 'APPROVED'.\n"
                        "Otherwise, list specific improvements needed."
                    ),
                }
            ],
        )

        critique_text = extract_text(critique.content)

        if "APPROVED" in critique_text:
            return draft

        # Revise based on critique
        revision = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            messages=[
                {
                    "role": "user",
                    "content": (
                        f"Task: {task}\n\n"
                        f"Previous output:\n{draft}\n\n"
                        f"Critique:\n{critique_text}\n\n"
                        "Revise the output to address every critique point. "
                        "Produce the complete revised version."
                    ),
                }
            ],
        )
        draft = extract_text(revision.content)

    return draft

Goal Decomposition with Dependency Tracking:

@dataclass
class TaskNode:
    id: str
    description: str
    dependencies: list[str] = field(default_factory=list)
    status: str = "pending"  # pending | running | completed | failed
    result: str | None = None


def build_task_graph(goal: str) -> list[TaskNode]:
    """Decompose a goal into a dependency-aware task graph."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system=(
            "Decompose the goal into tasks. For each task, specify an ID, "
            "description, and list of dependency IDs (tasks that must complete "
            "first). Output as JSON array."
        ),
        messages=[{"role": "user", "content": goal}],
    )
    raw_tasks = json.loads(extract_text(response.content))
    return [TaskNode(**t) for t in raw_tasks]


def execute_task_graph(tasks: list[TaskNode]) -> dict[str, str]:
    """Execute tasks respecting dependency order."""
    results = {}
    while any(t.status == "pending" for t in tasks):
        ready = [
            t for t in tasks
            if t.status == "pending"
            and all(
                dep_task.status == "completed"
                for dep_task in tasks
                if dep_task.id in t.dependencies
            )
        ]
        for task in ready:
            task.status = "running"
            context = {d: results[d] for d in task.dependencies if d in results}
            task.result = run_react_agent(
                f"Execute: {task.description}\nContext: {json.dumps(context)}"
            )
            task.status = "completed"
            results[task.id] = task.result

    return results

Step 5: Orchestrate Multi-Agent Systems

Supervisor Pattern (Hub-and-Spoke):

@dataclass
class AgentConfig:
    name: str
    system_prompt: str
    tools: list[dict]
    model: str = "claude-sonnet-4-20250514"


class SupervisorOrchestrator:
    """A supervisor agent delegates tasks to specialist agents."""

    def __init__(self, specialists: list[AgentConfig]):
        self.specialists = {s.name: s for s in specialists}

    def run(self, goal: str) -> str:
        """Supervisor decomposes goal and delegates to specialists."""
        specialist_descriptions = "\n".join(
            f"- {s.name}: {s.system_prompt[:100]}..."
            for s in self.specialists.values()
        )

        # Supervisor decides which specialists to invoke and in what order
        plan_response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            system=(
                "You are a supervisor agent. You delegate tasks to specialists.\n"
                f"Available specialists:\n{specialist_descriptions}\n\n"
                "For the given goal, output a JSON array of delegation steps:\n"
                '[{"specialist": "name", "task": "what to do", "depends_on": []}]'
            ),
            messages=[{"role": "user", "content": goal}],
        )

        delegations = json.loads(extract_text(plan_response.content))
        results = {}

        for step in delegations:
            specialist = self.specialists[step["specialist"]]
            context = {d: results[d] for d in step.get("depends_on", []) if d in results}
            result = self._run_specialist(
                specialist, step["task"], context
            )
            results[step["specialist"]] = result

        return self._synthesize(goal, results)

    def _run_specialist(
        self, config: AgentConfig, task: str, context: dict
    ) -> str:
        """Run a single specialist agent."""
        messages = [
            {
                "role": "user",
                "content": f"Task: {task}\nContext: {json.dumps(context)}",
            }
        ]
        response = client.messages.create(
            model=config.model,
            max_tokens=4096,
            system=config.system_prompt,
            tools=config.tools,
            messages=messages,
        )
        return extract_text(response.content)

    def _synthesize(self, goal: str, results: dict) -> str:
        """Combine specialist results into a final answer."""
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            messages=[
                {
                    "role": "user",
                    "content": (
                        f"Original goal: {goal}\n\n"
                        f"Specialist results:\n{json.dumps(results, indent=2)}\n\n"
                        "Synthesize these results into a coherent final answer."
                    ),
                }
            ],
        )
        return extract_text(response.content)

Step 6: Add Guardrails and Safety

Input Validation Layer:

@dataclass
class GuardrailResult:
    allowed: bool
    reason: str = ""
    modified_input: str | None = None


class AgentGuardrails:
    """Safety layer wrapping agent execution."""

    def __init__(self, max_tool_calls: int = 50, max_cost_usd: float = 1.0):
        self.max_tool_calls = max_tool_calls
        self.max_cost_usd = max_cost_usd
        self.tool_call_count = 0
        self.estimated_cost = 0.0
        self.blocked_patterns = [
            r"rm\s+-rf\s+/",           # Destructive filesystem commands
            r"DROP\s+TABLE",            # SQL injection attempts
            r"curl.*\|\s*bash",         # Remote code execution
        ]

    def check_input(self, user_input: str) -> GuardrailResult:
        """Validate user input before agent processing."""
        import re
        for pattern in self.blocked_patterns:
            if re.search(pattern, user_input, re.IGNORECASE):
                return GuardrailResult(
                    allowed=False,
                    reason=f"Input contains blocked pattern: {pattern}"
                )
        return GuardrailResult(allowed=True)

    def check_tool_call(self, tool_name: str, arguments: dict) -> GuardrailResult:
        """Validate a tool call before execution."""
        self.tool_call_count += 1

        if self.tool_call_count > self.max_tool_calls:
            return GuardrailResult(
                allowed=False,
                reason=f"Tool call limit reached ({self.max_tool_calls})"
            )

        # Block destructive file operations outside workspace
        if tool_name in ("edit_file", "write_file", "delete_file"):
            path = arguments.get("file_path", "")
            if not path.startswith(("/workspace/", "./", "src/")):
                return GuardrailResult(
                    allowed=False,
                    reason=f"File operation outside workspace: {path}"
                )

        return GuardrailResult(allowed=True)

    def check_output(self, output: str) -> GuardrailResult:
        """Validate agent output before returning to user."""
        # Check for leaked secrets or sensitive data patterns
        import re
        sensitive_patterns = [
            r"(?:api[_-]?key|secret|password|token)\s*[:=]\s*\S+",
            r"sk-[a-zA-Z0-9]{20,}",
            r"-----BEGIN (?:RSA )?PRIVATE KEY-----",
        ]
        for pattern in sensitive_patterns:
            if re.search(pattern, output, re.IGNORECASE):
                return GuardrailResult(
                    allowed=False,
                    reason="Output may contain sensitive data. Redacting."
                )
        return GuardrailResult(allowed=True)

Step 7: Evaluate and Test Agents

Evaluation Framework:

@dataclass
class EvalCase:
    """A single evaluation case for an agent."""
    name: str
    input: str
    expected_output: str | None = None       # For exact match
    expected_contains: list[str] | None = None  # For partial match
    max_tool_calls: int = 20
    max_seconds: float = 60.0
    tags: list[str] = field(default_factory=list)


@dataclass
class EvalResult:
    case_name: str
    passed: bool
    output: str
    tool_calls: int
    duration_seconds: float
    failure_reason: str = ""


def run_evaluation(agent_fn, cases: list[EvalCase]) -> list[EvalResult]:
    """Run an evaluation suite against an agent function."""
    import time
    results = []

    for case in cases:
        start = time.time()
        try:
            output = agent_fn(case.input)
            duration = time.time() - start

            passed = True
            reason = ""

            if case.expected_output and output.strip() != case.expected_output.strip():
                passed = False
                reason = f"Expected '{case.expected_output}', got '{output[:200]}'"

            if case.expected_contains:
                missing = [s for s in case.expected_contains if s not in output]
                if missing:
                    passed = False
                    reason = f"Output missing expected strings: {missing}"

            if duration > case.max_seconds:
                passed = False
                reason = f"Timeout: {duration:.1f}s > {case.max_seconds}s"

            results.append(EvalResult(
                case_name=case.name,
                passed=passed,
                output=output[:500],
                tool_calls=0,  # Instrumented via guardrails
                duration_seconds=duration,
                failure_reason=reason,
            ))

        except Exception as e:
            results.append(EvalResult(
                case_name=case.name,
                passed=False,
                output="",
                tool_calls=0,
                duration_seconds=time.time() - start,
                failure_reason=str(e),
            ))

    return results


def print_eval_report(results: list[EvalResult]):
    """Print a summary report of evaluation results."""
    passed = sum(1 for r in results if r.passed)
    total = len(results)
    print(f"\nAgent Evaluation: {passed}/{total} passed ({100*passed/total:.0f}%)\n")

    for r in results:
        status = "PASS" if r.passed else "FAIL"
        print(f"  [{status}] {r.case_name} ({r.duration_seconds:.1f}s)")
        if not r.passed:
            print(f"         Reason: {r.failure_reason}")

Step 8: Instrument for Observability

Structured Logging and Tracing:

import logging
import uuid
from contextvars import ContextVar
from functools import wraps

trace_id_var: ContextVar[str] = ContextVar("trace_id", default="")

logger = logging.getLogger("agent")


def new_trace() -> str:
    """Start a new trace and return its ID."""
    tid = uuid.uuid4().hex[:12]
    trace_id_var.set(tid)
    return tid


def traced(func):
    """Decorator that adds trace context to log messages."""
    @wraps(func)
    def wrapper(*args, **kwargs):
        trace_id = trace_id_var.get()
        logger.info(
            "step_start",
            extra={
                "trace_id": trace_id,
                "step": func.__name__,
                "args_summary": str(args)[:200],
            },
        )
        try:
            result = func(*args, **kwargs)
            logger.info(
                "step_end",
                extra={
                    "trace_id": trace_id,
                    "step": func.__name__,
                    "result_summary": str(result)[:200],
                },
            )
            return result
        except Exception as e:
            logger.error(
                "step_error",
                extra={
                    "trace_id": trace_id,
                    "step": func.__name__,
                    "error": str(e),
                },
            )
            raise
    return wrapper


@traced
def agent_step(task: str) -> str:
    """Example instrumented agent step."""
    return run_react_agent(task)

Cost Tracking:

@dataclass
class UsageTracker:
    """Track token usage and estimated cost across agent runs."""
    input_tokens: int = 0
    output_tokens: int = 0
    tool_calls: int = 0

    # Approximate pricing per million tokens (adjust to current rates)
    INPUT_COST_PER_M = 3.0
    OUTPUT_COST_PER_M = 15.0

    def record(self, response):
        """Record usage from an API response."""
        self.input_tokens += response.usage.input_tokens
        self.output_tokens += response.usage.output_tokens
        self.tool_calls += sum(
            1 for b in response.content if getattr(b, "type", "") == "tool_use"
        )

    @property
    def estimated_cost(self) -> float:
        return (
            (self.input_tokens / 1_000_000) * self.INPUT_COST_PER_M
            + (self.output_tokens / 1_000_000) * self.OUTPUT_COST_PER_M
        )

    def summary(self) -> str:
        return (
            f"Tokens: {self.input_tokens:,} in / {self.output_tokens:,} out | "
            f"Tool calls: {self.tool_calls} | "
            f"Est. cost: ${self.estimated_cost:.4f}"
        )

Best Practices

Start simple: Begin with a ReAct loop; add complexity only when needed
Fail loudly: Agents that silently fail are harder to debug than agents that error clearly
Limit tool count: Keep active tools to 10-20 for reliable selection (see tool-design skill)
Budget turns: Set explicit turn limits to prevent runaway agent loops
Log everything: Structured logs with trace IDs make debugging multi-step agents tractable
Test trajectories, not just outputs: Verify the agent took the right path, not just that it arrived at a plausible answer
Use cheaper models for planning: Planning steps need reasoning; execution steps often need tool use. Route accordingly
Keep memory lean: Summarize aggressively; stale context degrades quality faster than missing context
Add guardrails early: Safety checks are easier to add during development than to retrofit
Version your prompts: System prompts are code; treat them with the same versioning discipline

Common Patterns

Pattern 1: Retry with Backoff

When a tool call fails due to transient errors, retry with exponential backoff rather than immediately asking the user.

import time


def retry_tool_call(name: str, arguments: dict, max_retries: int = 3) -> str:
    """Retry a tool call with exponential backoff."""
    for attempt in range(max_retries):
        result = execute_tool(name, arguments)
        parsed = json.loads(result) if result.startswith("{") else {"output": result}

        if "error" not in parsed or not parsed.get("recoverable", True):
            return result

        wait = 2 ** attempt
        logger.warning(f"Tool {name} failed (attempt {attempt+1}), retrying in {wait}s")
        time.sleep(wait)

    return result  # Return last result even if failed

Pattern 2: Human-in-the-Loop Escalation

For high-stakes actions, pause and request confirmation before executing.

def human_in_the_loop(action_description: str, risk_level: str) -> bool:
    """Request human confirmation for risky actions."""
    if risk_level == "low":
        return True  # Auto-approve low-risk actions

    print(f"\n--- Agent requests approval ---")
    print(f"Action: {action_description}")
    print(f"Risk level: {risk_level}")
    response = input("Approve? [y/N]: ").strip().lower()
    return response == "y"

Pattern 3: Agent-as-Tool (Nested Agents)

Use one agent as a tool for another, creating hierarchical agent systems.

research_tool = {
    "name": "deep_research",
    "description": (
        "Perform in-depth research on a technical topic. "
        "Returns a detailed analysis with sources. "
        "Use for questions requiring multiple search steps."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "question": {
                "type": "string",
                "description": "The research question to investigate."
            }
        },
        "required": ["question"]
    }
}


def execute_research_agent(question: str) -> str:
    """Dedicated research sub-agent with web search tools."""
    return run_react_agent(
        f"Research this thoroughly and provide a detailed answer: {question}",
    )

Common Rationalizations

Rationalization	Reality
"We don't need turn limits — the agent will stop when it's done"	Agents in infinite loops or with misconfigured tools have run for hours and incurred thousands of dollars in API costs before operators noticed; hard turn and cost limits are non-negotiable safety mechanisms, not optional guardrails.
"Guardrails slow down the agent and hurt task performance"	Guardrails that reject invalid tool calls prevent irreversible side effects (deleting production data, sending emails to real users) that no subsequent action can undo; the performance cost of validation is orders of magnitude cheaper than remediation.
"We'll add observability after we validate the agent works"	Agent failures in production are non-deterministic and context-dependent; without structured logs and trace IDs from the first deployment, diagnosing why the agent chose an unexpected action path is practically impossible.
"A single monolithic agent is simpler than multi-agent orchestration"	A single agent with 50+ tools saturates the context window with tool descriptions, degrading routing accuracy; multi-agent systems with 5-10 focused tools per agent consistently outperform monolithic agents on complex tasks.
"Memory is optional for agents that run on short tasks"	Short-task agents that interact with the same user across sessions without memory force users to re-explain context every time; compounding user friction is the primary reason real-world agent adoption stalls.
"We'll handle planning failures by restarting the agent"	Restart without replanning repeats the same failure; agents need explicit failure detection and a replanning step that incorporates the failure as new context, not a blind retry of the same plan.

Anti-Patterns (Opus 4.7)

Three anti-patterns that specifically hurt Opus 4.7 agent workloads. These are distinct from the Common Rationalizations above: they are patterns that feel correct and used to be correct for Opus 4.6, but compound cost without matching quality gains on 4.7.

Anti-pattern	Why it feels right	Why it's wrong in Opus 4.7	What to do instead
Fixed thinking budgets (`max_thinking_tokens=20000`)	Predictable cost per turn; easy to reason about.	Opus 4.7 scales thinking adaptively per turn. A fixed budget truncates reasoning on hard turns and wastes budget on easy ones.	Omit `max_thinking_tokens`; set `effortLevel` instead. Drop one tier (e.g., `xhigh` -> `high`) for cost control.
Excessive tool-calling as "thorough investigation"	"The agent looked at everything, so the answer is well-grounded."	Opus 4.7 prefers reasoning; too many tool calls burn tokens without adding signal and can crowd out the actual problem solving.	Reason first; invoke tools only when external state is needed. Explicit tool-invocation prompts beat implicit "check everything."
`max` effort level on extended runs (loop-operator, temporal workflows, multi-iteration agents)	"Best results always."	`max` compounds per iteration - 20 loop turns at `max` can 2-3x cost vs `xhigh` with no observable quality win on routine subtasks.	Default `xhigh` for runs; reserve `max` for single hard one-shot analyses. For parallel fan-out, de-escalate to `high`.

Cross-links:

Full decision guidance: prompt-engineering - Effort-Level Strategy and the Opus 4.7 Practices section.
Output-side token optimization (distinct from tool-call frequency): prompt-token-optimization.

Verification

Turn limit and cost limit are configured and enforced: agent stops and reports status when limits are reached
Input guardrail validates all tool call arguments before execution; invalid inputs are rejected with an error, not silently corrected
Structured logs with trace IDs exist for every agent run, covering tool calls, tool results, and planning decisions
Evaluation suite runs on at least three scenarios: happy path, tool failure, and ambiguous input
Memory layer (if applicable) persists and retrieves context across sessions: verified by running the same query in two separate sessions
Replanning logic is triggered on tool failure: confirmed by injecting a tool error and observing the agent adjusts its plan

References

references/lifecycle-hooks.md - The agent-being-built lifecycle hook pattern (on_turn_start / on_turn_end / on_tool_call_pre / on_tool_call_post / on_error) for audit logging, retries, persona shifts, cost caps, and structured error recovery. Distinct from AI-assistant runtime hooks.
references/multimodal-ingestion.md - Agent multimodal input patterns: the two ingestion shapes (in-memory bytes vs. filesystem path), supported media families, and mixed prompt lists. Agent input, as distinct from output evaluation.
references/sdk-structured-output.md - Constraining agent output to a Pydantic schema (the response-contract pattern), its failure modes, and the retry-then-fail-closed recovery. Output constraint, as distinct from output evaluation.

Related Skills

[[tool-design]] -- designing tools and APIs for agent consumption
[[prompt-engineering]] -- crafting effective system prompts for agents
[[rag-implementation]] -- building retrieval systems agents can query
[[context-manager]] -- managing context budgets in long agent sessions
[[ai-agent-governance]] -- compliance and governance for AI agent deployments

Version: 1.0.0 Last Updated: March 2026

Iterative Refinement Strategy

This skill is optimized for an iterative approach:

Execute: Perform the core steps defined above.
Review: Critically analyze the output (coverage, quality, completeness).
Refine: If targets aren't met, repeat the specific implementation steps with improved context.
Loop: Continue until the definition of done is satisfied.

name	ai-agent-development
description	AI agent architecture and development patterns including tool use, memory systems, planning loops, and multi-agent orchestration. Use when building AI agents, designing tool interfaces, or implementing agent evaluation.
summary_l0	Build AI agents with tool use, memory, planning loops, and multi-agent orchestration
overview_l1	This skill provides comprehensive architecture and implementation guidance for building AI agents that reason, plan, use tools, and collaborate. Use it when designing agent architectures (ReAct, plan-and-execute, reflection), implementing tool-use patterns with function calling or MCP, building memory systems (short-term, long-term, episodic, semantic), creating planning and reasoning loops, orchestrating multi-agent systems, or adding guardrails and evaluation. Key capabilities include architecture pattern selection, tool integration schemas, memory system design, planning loop implementation, multi-agent topologies (supervisor, peer-to-peer, pipeline), guardrail enforcement, trajectory-based evaluation, and observability instrumentation. The expected output is production-grade agent code with structured logging, error handling, and cost tracking. Trigger phrases: build an agent, agent architecture, tool use, function calling, agent memory, planning loop, multi-agent, agent orchestration, agent evaluation, agent guardrails, ReAct pattern, MCP server.

AI Agent Development

Effort-level note: when agents run inside loops or parallel fan-out, default the effortLevel to high, not max. Iterative and concurrent runs compound effort-level cost without matching quality gains. See the Effort-Level Strategy section of prompt-engineering/SKILL.md for the decision table. A full Opus 4.7 Anti-Patterns table for agent workloads is maintained alongside this skill.

When to Use This Skill

Use this skill for:

Designing agent architectures (ReAct, plan-and-execute, reflection)
Implementing tool-use patterns with function calling or MCP
Building memory systems (short-term, long-term, episodic, semantic)
Creating planning and reasoning loops
Orchestrating multi-agent systems
Adding guardrails and safety layers to agents
Evaluating and testing agent behavior
Instrumenting agents for observability

What This Skill Does

Provides agent development expertise including:

Architecture Patterns: ReAct, plan-and-execute, reflection, and hybrid approaches
Tool Integration: Function calling schemas, MCP servers, tool selection logic
Memory Systems: Working memory, long-term stores, episodic recall, semantic search
Planning Loops: Goal decomposition, step execution, replanning on failure
Multi-Agent Orchestration: Supervisor, peer-to-peer, and pipeline topologies
Guardrails: Input validation, output filtering, budget limits, human-in-the-loop
Evaluation: Task completion metrics, trajectory analysis, regression testing
Observability: Structured logging, trace propagation, cost tracking

Instructions

Step 1: Choose an Agent Architecture

Select an architecture based on task complexity, latency requirements, and reliability needs.

Architecture Comparison:

Architecture	Best For	Latency	Reliability	Complexity
ReAct	Simple tool-use tasks	Low	Medium	Low
Plan-and-Execute	Multi-step workflows	Medium	High	Medium
Reflection	Quality-critical outputs	High	High	Medium
Multi-Agent	Complex domain tasks	High	Variable	High

ReAct (Reason + Act) Pattern:

The agent interleaves reasoning and action in a loop: think about what to do, execute a tool, observe the result, repeat.

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "search_codebase",
        "description": (
            "Search the codebase for files matching a regex pattern. "
            "Returns file paths and matching line contents. "
            "Use when looking for function definitions, imports, or string patterns."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "pattern": {
                    "type": "string",
                    "description": "Regex pattern to search for."
                },
                "file_glob": {
                    "type": "string",
                    "description": "Optional glob to filter files (e.g., '*.py').",
                    "default": "*"
                }
            },
            "required": ["pattern"]
        }
    },
    {
        "name": "read_file",
        "description": (
            "Read the full contents of a file by path. "
            "Use when you need to examine a specific file found via search."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Absolute or workspace-relative file path."
                }
            },
            "required": ["path"]
        }
    }
]


def run_react_agent(user_query: str, max_turns: int = 10) -> str:
    """Run a ReAct agent loop with Claude."""
    messages = [{"role": "user", "content": user_query}]

    for turn in range(max_turns):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=(
                "You are a code analysis agent. Think step-by-step about what "
                "information you need, use tools to gather it, then provide "
                "your analysis. Always explain your reasoning before acting."
            ),
            tools=tools,
            messages=messages,
        )

        # If the model wants to use tools, execute them
        if response.stop_reason == "tool_use":
            # Append the assistant's response (contains tool_use blocks)
            messages.append({"role": "assistant", "content": response.content})

            # Execute each tool call and collect results
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })

            messages.append({"role": "user", "content": tool_results})
        else:
            # Model produced a final text response
            return extract_text(response.content)

    return "Agent reached maximum turn limit without completing."

Plan-and-Execute Pattern:

Separate planning from execution. The planner decomposes the goal into steps; the executor handles each step independently.

from dataclasses import dataclass, field


@dataclass
class Plan:
    goal: str
    steps: list[str] = field(default_factory=list)
    completed: list[int] = field(default_factory=list)
    results: dict[int, str] = field(default_factory=dict)


def create_plan(goal: str) -> Plan:
    """Use the LLM to decompose a goal into executable steps."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system=(
            "You are a planning agent. Decompose the user's goal into "
            "a numbered list of concrete, independent steps. Each step "
            "should be actionable with available tools. Output ONLY the "
            "numbered list, nothing else."
        ),
        messages=[{"role": "user", "content": goal}],
    )
    steps = parse_numbered_list(extract_text(response.content))
    return Plan(goal=goal, steps=steps)


def execute_plan(plan: Plan) -> str:
    """Execute each step, replanning if a step fails."""
    for i, step in enumerate(plan.steps):
        if i in plan.completed:
            continue

        result = run_react_agent(
            f"Execute this step: {step}\n\n"
            f"Context from previous steps:\n{format_results(plan.results)}"
        )

        if is_failure(result):
            # Replan from this point forward
            revised = replan(plan, i, result)
            plan.steps = plan.steps[:i] + revised
            continue

        plan.results[i] = result
        plan.completed.append(i)

    return synthesize_results(plan)

Step 2: Implement Tool Integration

Tools are the agent's interface to the external world. Design them for LLM consumption (see the tool-design skill for detailed guidance).

Function Calling Schema Best Practices:

# Good: Specific, documented, constrained
file_edit_tool = {
    "name": "edit_file",
    "description": (
        "Replace a specific string in a file with new content. "
        "The old_string must appear exactly once in the file. "
        "Use when you need to modify existing code. "
        "Do NOT use for creating new files (use write_file instead)."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "file_path": {
                "type": "string",
                "description": "Absolute path to the file to edit."
            },
            "old_string": {
                "type": "string",
                "description": "The exact text to find and replace. Must be unique in the file."
            },
            "new_string": {
                "type": "string",
                "description": "The replacement text. Must differ from old_string."
            }
        },
        "required": ["file_path", "old_string", "new_string"]
    }
}

MCP Server Integration:

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client


async def connect_mcp_server(command: str, args: list[str]) -> ClientSession:
    """Connect to an MCP server and return a session."""
    server_params = StdioServerParameters(command=command, args=args)

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # List available tools
            tools_response = await session.list_tools()
            for tool in tools_response.tools:
                print(f"  {tool.name}: {tool.description}")

            return session


async def call_mcp_tool(session: ClientSession, name: str, arguments: dict):
    """Call a tool on the MCP server and return the result."""
    result = await session.call_tool(name, arguments=arguments)
    return result.content

Tool Execution with Error Handling:

import json
import traceback


def execute_tool(name: str, arguments: dict) -> str:
    """Execute a tool call with structured error handling."""
    try:
        if name == "search_codebase":
            return search_codebase(**arguments)
        elif name == "read_file":
            return read_file(**arguments)
        elif name == "edit_file":
            return edit_file(**arguments)
        else:
            return json.dumps({
                "error": f"Unknown tool: {name}",
                "available_tools": ["search_codebase", "read_file", "edit_file"],
                "suggestion": "Check the tool name and try again."
            })
    except FileNotFoundError as e:
        return json.dumps({
            "error": f"File not found: {e}",
            "suggestion": "Use search_codebase to find the correct file path."
        })
    except PermissionError as e:
        return json.dumps({
            "error": f"Permission denied: {e}",
            "recoverable": False,
            "suggestion": "This file cannot be modified. Ask the user for guidance."
        })
    except Exception as e:
        return json.dumps({
            "error": f"{type(e).__name__}: {e}",
            "traceback": traceback.format_exc(),
            "suggestion": "An unexpected error occurred. Try a different approach."
        })

Step 3: Build Memory Systems

Agents need memory to maintain context across turns, learn from past interactions, and recall relevant information.

Memory Type Overview:

Type	Scope	Storage	Use Case
Working	Current conversation	Message history	Immediate context
Episodic	Past interactions	Database	Recall similar tasks
Semantic	Domain knowledge	Vector store	Fact retrieval
Procedural	Learned workflows	Key-value store	Skill reuse

Working Memory (Conversation Buffer with Summarization):

from dataclasses import dataclass


@dataclass
class WorkingMemory:
    messages: list[dict]
    max_tokens: int = 100_000
    summary: str = ""

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        if self.estimate_tokens() > self.max_tokens * 0.8:
            self.compact()

    def compact(self):
        """Summarize older messages to free context space."""
        # Keep the most recent messages intact
        keep_recent = 6
        old_messages = self.messages[:-keep_recent]
        recent_messages = self.messages[-keep_recent:]

        summary_prompt = (
            f"Previous summary: {self.summary}\n\n"
            f"New messages to summarize:\n"
            f"{format_messages(old_messages)}\n\n"
            "Produce a concise summary preserving key decisions, "
            "findings, and action items."
        )
        self.summary = call_llm_for_summary(summary_prompt)
        self.messages = recent_messages

    def get_context(self) -> list[dict]:
        """Return messages with summary prepended if available."""
        if self.summary:
            summary_msg = {
                "role": "user",
                "content": f"[Context from earlier in this conversation]\n{self.summary}"
            }
            return [summary_msg] + self.messages
        return self.messages

    def estimate_tokens(self) -> int:
        return sum(len(m["content"]) // 4 for m in self.messages)

Long-Term Episodic Memory:

import hashlib
from datetime import datetime


class EpisodicMemory:
    """Store and recall past agent interactions by similarity."""

    def __init__(self, vector_store, embedding_model):
        self.store = vector_store
        self.embedder = embedding_model

    def record_episode(self, task: str, trajectory: list[dict], outcome: str):
        """Save a completed task episode for future recall."""
        episode = {
            "task": task,
            "trajectory_summary": summarize_trajectory(trajectory),
            "outcome": outcome,
            "timestamp": datetime.utcnow().isoformat(),
            "id": hashlib.sha256(f"{task}{datetime.utcnow()}".encode()).hexdigest()[:16],
        }
        embedding = self.embedder.embed(f"{task} {outcome}")
        self.store.upsert(id=episode["id"], vector=embedding, metadata=episode)

    def recall(self, current_task: str, top_k: int = 3) -> list[dict]:
        """Recall past episodes similar to the current task."""
        query_embedding = self.embedder.embed(current_task)
        results = self.store.query(vector=query_embedding, top_k=top_k)
        return [
            {
                "task": r.metadata["task"],
                "approach": r.metadata["trajectory_summary"],
                "outcome": r.metadata["outcome"],
                "similarity": r.score,
            }
            for r in results
        ]

Step 4: Implement Planning and Reasoning Loops

Reflection Pattern (Self-Critique Loop):

def reflect_and_improve(task: str, max_iterations: int = 3) -> str:
    """Generate output, critique it, and iterate until quality threshold is met."""
    draft = generate_initial_response(task)

    for iteration in range(max_iterations):
        critique = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            messages=[
                {
                    "role": "user",
                    "content": (
                        f"Task: {task}\n\n"
                        f"Current output:\n{draft}\n\n"
                        "Critique this output. Identify specific problems with:\n"
                        "1. Correctness: Are there factual or logical errors?\n"
                        "2. Completeness: Is anything missing?\n"
                        "3. Quality: Could the structure or clarity improve?\n\n"
                        "If the output is satisfactory, respond with ONLY 'APPROVED'.\n"
                        "Otherwise, list specific improvements needed."
                    ),
                }
            ],
        )

        critique_text = extract_text(critique.content)

        if "APPROVED" in critique_text:
            return draft

        # Revise based on critique
        revision = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            messages=[
                {
                    "role": "user",
                    "content": (
                        f"Task: {task}\n\n"
                        f"Previous output:\n{draft}\n\n"
                        f"Critique:\n{critique_text}\n\n"
                        "Revise the output to address every critique point. "
                        "Produce the complete revised version."
                    ),
                }
            ],
        )
        draft = extract_text(revision.content)

    return draft

Goal Decomposition with Dependency Tracking:

@dataclass
class TaskNode:
    id: str
    description: str
    dependencies: list[str] = field(default_factory=list)
    status: str = "pending"  # pending | running | completed | failed
    result: str | None = None


def build_task_graph(goal: str) -> list[TaskNode]:
    """Decompose a goal into a dependency-aware task graph."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system=(
            "Decompose the goal into tasks. For each task, specify an ID, "
            "description, and list of dependency IDs (tasks that must complete "
            "first). Output as JSON array."
        ),
        messages=[{"role": "user", "content": goal}],
    )
    raw_tasks = json.loads(extract_text(response.content))
    return [TaskNode(**t) for t in raw_tasks]


def execute_task_graph(tasks: list[TaskNode]) -> dict[str, str]:
    """Execute tasks respecting dependency order."""
    results = {}
    while any(t.status == "pending" for t in tasks):
        ready = [
            t for t in tasks
            if t.status == "pending"
            and all(
                dep_task.status == "completed"
                for dep_task in tasks
                if dep_task.id in t.dependencies
            )
        ]
        for task in ready:
            task.status = "running"
            context = {d: results[d] for d in task.dependencies if d in results}
            task.result = run_react_agent(
                f"Execute: {task.description}\nContext: {json.dumps(context)}"
            )
            task.status = "completed"
            results[task.id] = task.result

    return results

Step 5: Orchestrate Multi-Agent Systems

Supervisor Pattern (Hub-and-Spoke):

@dataclass
class AgentConfig:
    name: str
    system_prompt: str
    tools: list[dict]
    model: str = "claude-sonnet-4-20250514"


class SupervisorOrchestrator:
    """A supervisor agent delegates tasks to specialist agents."""

    def __init__(self, specialists: list[AgentConfig]):
        self.specialists = {s.name: s for s in specialists}

    def run(self, goal: str) -> str:
        """Supervisor decomposes goal and delegates to specialists."""
        specialist_descriptions = "\n".join(
            f"- {s.name}: {s.system_prompt[:100]}..."
            for s in self.specialists.values()
        )

        # Supervisor decides which specialists to invoke and in what order
        plan_response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=2048,
            system=(
                "You are a supervisor agent. You delegate tasks to specialists.\n"
                f"Available specialists:\n{specialist_descriptions}\n\n"
                "For the given goal, output a JSON array of delegation steps:\n"
                '[{"specialist": "name", "task": "what to do", "depends_on": []}]'
            ),
            messages=[{"role": "user", "content": goal}],
        )

        delegations = json.loads(extract_text(plan_response.content))
        results = {}

        for step in delegations:
            specialist = self.specialists[step["specialist"]]
            context = {d: results[d] for d in step.get("depends_on", []) if d in results}
            result = self._run_specialist(
                specialist, step["task"], context
            )
            results[step["specialist"]] = result

        return self._synthesize(goal, results)

    def _run_specialist(
        self, config: AgentConfig, task: str, context: dict
    ) -> str:
        """Run a single specialist agent."""
        messages = [
            {
                "role": "user",
                "content": f"Task: {task}\nContext: {json.dumps(context)}",
            }
        ]
        response = client.messages.create(
            model=config.model,
            max_tokens=4096,
            system=config.system_prompt,
            tools=config.tools,
            messages=messages,
        )
        return extract_text(response.content)

    def _synthesize(self, goal: str, results: dict) -> str:
        """Combine specialist results into a final answer."""
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            messages=[
                {
                    "role": "user",
                    "content": (
                        f"Original goal: {goal}\n\n"
                        f"Specialist results:\n{json.dumps(results, indent=2)}\n\n"
                        "Synthesize these results into a coherent final answer."
                    ),
                }
            ],
        )
        return extract_text(response.content)

Step 6: Add Guardrails and Safety

Input Validation Layer:

@dataclass
class GuardrailResult:
    allowed: bool
    reason: str = ""
    modified_input: str | None = None


class AgentGuardrails:
    """Safety layer wrapping agent execution."""

    def __init__(self, max_tool_calls: int = 50, max_cost_usd: float = 1.0):
        self.max_tool_calls = max_tool_calls
        self.max_cost_usd = max_cost_usd
        self.tool_call_count = 0
        self.estimated_cost = 0.0
        self.blocked_patterns = [
            r"rm\s+-rf\s+/",           # Destructive filesystem commands
            r"DROP\s+TABLE",            # SQL injection attempts
            r"curl.*\|\s*bash",         # Remote code execution
        ]

    def check_input(self, user_input: str) -> GuardrailResult:
        """Validate user input before agent processing."""
        import re
        for pattern in self.blocked_patterns:
            if re.search(pattern, user_input, re.IGNORECASE):
                return GuardrailResult(
                    allowed=False,
                    reason=f"Input contains blocked pattern: {pattern}"
                )
        return GuardrailResult(allowed=True)

    def check_tool_call(self, tool_name: str, arguments: dict) -> GuardrailResult:
        """Validate a tool call before execution."""
        self.tool_call_count += 1

        if self.tool_call_count > self.max_tool_calls:
            return GuardrailResult(
                allowed=False,
                reason=f"Tool call limit reached ({self.max_tool_calls})"
            )

        # Block destructive file operations outside workspace
        if tool_name in ("edit_file", "write_file", "delete_file"):
            path = arguments.get("file_path", "")
            if not path.startswith(("/workspace/", "./", "src/")):
                return GuardrailResult(
                    allowed=False,
                    reason=f"File operation outside workspace: {path}"
                )

        return GuardrailResult(allowed=True)

    def check_output(self, output: str) -> GuardrailResult:
        """Validate agent output before returning to user."""
        # Check for leaked secrets or sensitive data patterns
        import re
        sensitive_patterns = [
            r"(?:api[_-]?key|secret|password|token)\s*[:=]\s*\S+",
            r"sk-[a-zA-Z0-9]{20,}",
            r"-----BEGIN (?:RSA )?PRIVATE KEY-----",
        ]
        for pattern in sensitive_patterns:
            if re.search(pattern, output, re.IGNORECASE):
                return GuardrailResult(
                    allowed=False,
                    reason="Output may contain sensitive data. Redacting."
                )
        return GuardrailResult(allowed=True)

Step 7: Evaluate and Test Agents

Evaluation Framework:

@dataclass
class EvalCase:
    """A single evaluation case for an agent."""
    name: str
    input: str
    expected_output: str | None = None       # For exact match
    expected_contains: list[str] | None = None  # For partial match
    max_tool_calls: int = 20
    max_seconds: float = 60.0
    tags: list[str] = field(default_factory=list)


@dataclass
class EvalResult:
    case_name: str
    passed: bool
    output: str
    tool_calls: int
    duration_seconds: float
    failure_reason: str = ""


def run_evaluation(agent_fn, cases: list[EvalCase]) -> list[EvalResult]:
    """Run an evaluation suite against an agent function."""
    import time
    results = []

    for case in cases:
        start = time.time()
        try:
            output = agent_fn(case.input)
            duration = time.time() - start

            passed = True
            reason = ""

            if case.expected_output and output.strip() != case.expected_output.strip():
                passed = False
                reason = f"Expected '{case.expected_output}', got '{output[:200]}'"

            if case.expected_contains:
                missing = [s for s in case.expected_contains if s not in output]
                if missing:
                    passed = False
                    reason = f"Output missing expected strings: {missing}"

            if duration > case.max_seconds:
                passed = False
                reason = f"Timeout: {duration:.1f}s > {case.max_seconds}s"

            results.append(EvalResult(
                case_name=case.name,
                passed=passed,
                output=output[:500],
                tool_calls=0,  # Instrumented via guardrails
                duration_seconds=duration,
                failure_reason=reason,
            ))

        except Exception as e:
            results.append(EvalResult(
                case_name=case.name,
                passed=False,
                output="",
                tool_calls=0,
                duration_seconds=time.time() - start,
                failure_reason=str(e),
            ))

    return results


def print_eval_report(results: list[EvalResult]):
    """Print a summary report of evaluation results."""
    passed = sum(1 for r in results if r.passed)
    total = len(results)
    print(f"\nAgent Evaluation: {passed}/{total} passed ({100*passed/total:.0f}%)\n")

    for r in results:
        status = "PASS" if r.passed else "FAIL"
        print(f"  [{status}] {r.case_name} ({r.duration_seconds:.1f}s)")
        if not r.passed:
            print(f"         Reason: {r.failure_reason}")

Step 8: Instrument for Observability

Structured Logging and Tracing:

import logging
import uuid
from contextvars import ContextVar
from functools import wraps

trace_id_var: ContextVar[str] = ContextVar("trace_id", default="")

logger = logging.getLogger("agent")


def new_trace() -> str:
    """Start a new trace and return its ID."""
    tid = uuid.uuid4().hex[:12]
    trace_id_var.set(tid)
    return tid


def traced(func):
    """Decorator that adds trace context to log messages."""
    @wraps(func)
    def wrapper(*args, **kwargs):
        trace_id = trace_id_var.get()
        logger.info(
            "step_start",
            extra={
                "trace_id": trace_id,
                "step": func.__name__,
                "args_summary": str(args)[:200],
            },
        )
        try:
            result = func(*args, **kwargs)
            logger.info(
                "step_end",
                extra={
                    "trace_id": trace_id,
                    "step": func.__name__,
                    "result_summary": str(result)[:200],
                },
            )
            return result
        except Exception as e:
            logger.error(
                "step_error",
                extra={
                    "trace_id": trace_id,
                    "step": func.__name__,
                    "error": str(e),
                },
            )
            raise
    return wrapper


@traced
def agent_step(task: str) -> str:
    """Example instrumented agent step."""
    return run_react_agent(task)

Cost Tracking:

@dataclass
class UsageTracker:
    """Track token usage and estimated cost across agent runs."""
    input_tokens: int = 0
    output_tokens: int = 0
    tool_calls: int = 0

    # Approximate pricing per million tokens (adjust to current rates)
    INPUT_COST_PER_M = 3.0
    OUTPUT_COST_PER_M = 15.0

    def record(self, response):
        """Record usage from an API response."""
        self.input_tokens += response.usage.input_tokens
        self.output_tokens += response.usage.output_tokens
        self.tool_calls += sum(
            1 for b in response.content if getattr(b, "type", "") == "tool_use"
        )

    @property
    def estimated_cost(self) -> float:
        return (
            (self.input_tokens / 1_000_000) * self.INPUT_COST_PER_M
            + (self.output_tokens / 1_000_000) * self.OUTPUT_COST_PER_M
        )

    def summary(self) -> str:
        return (
            f"Tokens: {self.input_tokens:,} in / {self.output_tokens:,} out | "
            f"Tool calls: {self.tool_calls} | "
            f"Est. cost: ${self.estimated_cost:.4f}"
        )

Best Practices

Start simple: Begin with a ReAct loop; add complexity only when needed
Fail loudly: Agents that silently fail are harder to debug than agents that error clearly
Limit tool count: Keep active tools to 10-20 for reliable selection (see tool-design skill)
Budget turns: Set explicit turn limits to prevent runaway agent loops
Log everything: Structured logs with trace IDs make debugging multi-step agents tractable
Test trajectories, not just outputs: Verify the agent took the right path, not just that it arrived at a plausible answer
Use cheaper models for planning: Planning steps need reasoning; execution steps often need tool use. Route accordingly
Keep memory lean: Summarize aggressively; stale context degrades quality faster than missing context
Add guardrails early: Safety checks are easier to add during development than to retrofit
Version your prompts: System prompts are code; treat them with the same versioning discipline

Common Patterns

Pattern 1: Retry with Backoff

When a tool call fails due to transient errors, retry with exponential backoff rather than immediately asking the user.

import time


def retry_tool_call(name: str, arguments: dict, max_retries: int = 3) -> str:
    """Retry a tool call with exponential backoff."""
    for attempt in range(max_retries):
        result = execute_tool(name, arguments)
        parsed = json.loads(result) if result.startswith("{") else {"output": result}

        if "error" not in parsed or not parsed.get("recoverable", True):
            return result

        wait = 2 ** attempt
        logger.warning(f"Tool {name} failed (attempt {attempt+1}), retrying in {wait}s")
        time.sleep(wait)

    return result  # Return last result even if failed

Pattern 2: Human-in-the-Loop Escalation

For high-stakes actions, pause and request confirmation before executing.

def human_in_the_loop(action_description: str, risk_level: str) -> bool:
    """Request human confirmation for risky actions."""
    if risk_level == "low":
        return True  # Auto-approve low-risk actions

    print(f"\n--- Agent requests approval ---")
    print(f"Action: {action_description}")
    print(f"Risk level: {risk_level}")
    response = input("Approve? [y/N]: ").strip().lower()
    return response == "y"

Pattern 3: Agent-as-Tool (Nested Agents)

Use one agent as a tool for another, creating hierarchical agent systems.

research_tool = {
    "name": "deep_research",
    "description": (
        "Perform in-depth research on a technical topic. "
        "Returns a detailed analysis with sources. "
        "Use for questions requiring multiple search steps."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "question": {
                "type": "string",
                "description": "The research question to investigate."
            }
        },
        "required": ["question"]
    }
}


def execute_research_agent(question: str) -> str:
    """Dedicated research sub-agent with web search tools."""
    return run_react_agent(
        f"Research this thoroughly and provide a detailed answer: {question}",
    )

Common Rationalizations

Rationalization	Reality
"We don't need turn limits — the agent will stop when it's done"	Agents in infinite loops or with misconfigured tools have run for hours and incurred thousands of dollars in API costs before operators noticed; hard turn and cost limits are non-negotiable safety mechanisms, not optional guardrails.
"Guardrails slow down the agent and hurt task performance"	Guardrails that reject invalid tool calls prevent irreversible side effects (deleting production data, sending emails to real users) that no subsequent action can undo; the performance cost of validation is orders of magnitude cheaper than remediation.
"We'll add observability after we validate the agent works"	Agent failures in production are non-deterministic and context-dependent; without structured logs and trace IDs from the first deployment, diagnosing why the agent chose an unexpected action path is practically impossible.
"A single monolithic agent is simpler than multi-agent orchestration"	A single agent with 50+ tools saturates the context window with tool descriptions, degrading routing accuracy; multi-agent systems with 5-10 focused tools per agent consistently outperform monolithic agents on complex tasks.
"Memory is optional for agents that run on short tasks"	Short-task agents that interact with the same user across sessions without memory force users to re-explain context every time; compounding user friction is the primary reason real-world agent adoption stalls.
"We'll handle planning failures by restarting the agent"	Restart without replanning repeats the same failure; agents need explicit failure detection and a replanning step that incorporates the failure as new context, not a blind retry of the same plan.

Anti-Patterns (Opus 4.7)

Anti-pattern	Why it feels right	Why it's wrong in Opus 4.7	What to do instead
Fixed thinking budgets (`max_thinking_tokens=20000`)	Predictable cost per turn; easy to reason about.	Opus 4.7 scales thinking adaptively per turn. A fixed budget truncates reasoning on hard turns and wastes budget on easy ones.	Omit `max_thinking_tokens`; set `effortLevel` instead. Drop one tier (e.g., `xhigh` -> `high`) for cost control.
Excessive tool-calling as "thorough investigation"	"The agent looked at everything, so the answer is well-grounded."	Opus 4.7 prefers reasoning; too many tool calls burn tokens without adding signal and can crowd out the actual problem solving.	Reason first; invoke tools only when external state is needed. Explicit tool-invocation prompts beat implicit "check everything."
`max` effort level on extended runs (loop-operator, temporal workflows, multi-iteration agents)	"Best results always."	`max` compounds per iteration - 20 loop turns at `max` can 2-3x cost vs `xhigh` with no observable quality win on routine subtasks.	Default `xhigh` for runs; reserve `max` for single hard one-shot analyses. For parallel fan-out, de-escalate to `high`.

Cross-links:

Full decision guidance: prompt-engineering - Effort-Level Strategy and the Opus 4.7 Practices section.
Output-side token optimization (distinct from tool-call frequency): prompt-token-optimization.

Verification

Turn limit and cost limit are configured and enforced: agent stops and reports status when limits are reached
Input guardrail validates all tool call arguments before execution; invalid inputs are rejected with an error, not silently corrected
Structured logs with trace IDs exist for every agent run, covering tool calls, tool results, and planning decisions
Evaluation suite runs on at least three scenarios: happy path, tool failure, and ambiguous input
Memory layer (if applicable) persists and retrieves context across sessions: verified by running the same query in two separate sessions
Replanning logic is triggered on tool failure: confirmed by injecting a tool error and observing the agent adjusts its plan

References

references/lifecycle-hooks.md - The agent-being-built lifecycle hook pattern (on_turn_start / on_turn_end / on_tool_call_pre / on_tool_call_post / on_error) for audit logging, retries, persona shifts, cost caps, and structured error recovery. Distinct from AI-assistant runtime hooks.
references/multimodal-ingestion.md - Agent multimodal input patterns: the two ingestion shapes (in-memory bytes vs. filesystem path), supported media families, and mixed prompt lists. Agent input, as distinct from output evaluation.
references/sdk-structured-output.md - Constraining agent output to a Pydantic schema (the response-contract pattern), its failure modes, and the retry-then-fail-closed recovery. Output constraint, as distinct from output evaluation.

Related Skills

[[tool-design]] -- designing tools and APIs for agent consumption
[[prompt-engineering]] -- crafting effective system prompts for agents
[[rag-implementation]] -- building retrieval systems agents can query
[[context-manager]] -- managing context budgets in long agent sessions
[[ai-agent-governance]] -- compliance and governance for AI agent deployments

Version: 1.0.0 Last Updated: March 2026

Iterative Refinement Strategy

This skill is optimized for an iterative approach:

Execute: Perform the core steps defined above.
Review: Critically analyze the output (coverage, quality, completeness).
Refine: If targets aren't met, repeat the specific implementation steps with improved context.
Loop: Continue until the definition of done is satisfied.