| name | agent-design-review |
| description | Designs, reviews, and iterates on LLM agents and agent-like workflows. Use when asked to "design an agent", "review this agent", "improve our system prompt", "optimize prompts for caching", "improve tool calling", "reduce hallucinated tool calls", "add structured outputs", "decide if this should be multi-agent", "reduce false positives", "tune agent thresholds", or "build evals for this agent". Covers architecture choice, cache-friendly prompt templates, tool and schema design, runtime loops, trust boundaries, and eval-driven iteration. |
Agent Design Review
Design or review agents by identifying the success contract first, mapping the real execution path second, and changing the smallest layer that is actually causing failures.
Load only what applies:
| Need | Read |
|---|
| Choose architecture or multi-agent shape | references/principles.md |
| Rewrite prompts or improve cache reuse | references/prompt-and-caching.md |
| Draft or rewrite an actual system prompt | references/system-prompt-templates.md |
| Draft a provider-specific prompt for OpenAI Responses or Anthropic tool use | references/provider-specific-templates.md |
| Improve tool calling, tool schemas, or final outputs | references/tool-and-schema-design.md |
| Draft or fix actual tool schemas | references/tool-schema-examples.md |
| Review loops, approvals, side effects, or trust boundaries | references/runtime-and-guardrails.md |
| Build evals or decide how to iterate | references/evals-and-iteration.md |
| Review classifier, matcher, router, extractor, ranker, or moderation agent | references/classifier-agents.md |
| Need examples of strong and weak output | references/review-examples.md |
Step 1: Set Mode and Success Contract
Set the task mode first:
design: a new agent or major redesign
review: assess an existing agent and prioritize changes
debug: explain why a current agent is failing and what to change first
Then write a short success contract:
- task the system must complete
- target quality or success metric
- unacceptable failures
- cost and latency budget
- operator or reviewer load constraints
- side effects the system may take
- tools, data sources, and approvals available
- current eval status
If the user asks only for a prompt rewrite, still check whether retrieval, tools, thresholds, or runtime policy dominate the failures.
Step 2: Choose the Smallest Architecture That Works
Use references/principles.md.
Classify the system before proposing changes:
| Pattern | Use when | Avoid when |
|---|
| Deterministic workflow | The task is mostly rule-based or decomposes cleanly in code | The model must explore or use tools adaptively |
| Single agent | One prompt plus tools can reliably solve the task in a loop | Prompt complexity or tool overload makes behavior unstable |
| Multi-agent system | Distinct roles, tools, or trust boundaries must stay separate | You are adding agents without a measured bottleneck |
Prefer deterministic preprocessing, retrieval, routing, or thresholds before adding more agent autonomy.
Step 3: Map the Real Execution Path
Write an execution-path summary that names:
- static instructions
- dynamic request context
- deterministic preprocessing and normalization
- retrieval or candidate generation
- tool list and tool descriptions
- loop and stop conditions
- final output schema
- post-model validation or sanitization
- automation thresholds and approval gates
- current evals, traces, tests, or queue feedback
For classifier-style systems, separate deterministic stages from model-driven stages. Do not review only the prompt if code outside the prompt decides most of the behavior.
Step 4: Identify the Primary Bottleneck
Inspect the highest-risk layer first:
| Layer | Check |
|---|
| Architecture | Is this over-agentized? |
| Prompt | Is policy explicit, structured, and stable enough for caching? |
| Retrieval | Is the right evidence or candidate set available before the model decides? |
| Tools | Are tool interfaces narrow, typed, and easy to choose correctly? |
| Output contract | Are actions and state machine-checkable? |
| Runtime | Are retries, stop conditions, and fallbacks explicit? |
| Boundaries | Are approvals, auth, and trust boundaries enforced outside the prompt? |
| Thresholds | Do confidence and automation gates map to real consequences? |
| Evals | Can proposed changes be measured? |
Do not default to prompt rewrites if retrieval, thresholds, or post-model guards dominate the failures.
Step 5: Follow the Relevant Review or Design Path
Review or Debug Path
- Summarize the execution path.
- Name the primary bottleneck.
- Report findings ordered by severity.
- Recommend the smallest effective changes first.
- Add an eval plan that can prove whether the changes helped.
For each finding, include:
- layer
- evidence from prompt, tools, code, traces, or tests
- likely impact on quality, cost, or operator load
- smallest effective change
Design Path
- Define the success contract.
- Justify the architecture choice.
- Draft a stable prompt template.
- Define tool contracts and typed outputs.
- Define loop policy, approvals, and fallback behavior.
- Define the eval plan before extensive iteration.
If you write a prompt, return a cache-friendly prompt skeleton with clear slots for dynamic inputs rather than an unstructured wall of text.
If you write tool schemas, return concrete schema drafts with parameter descriptions, enums, and required fields instead of only high-level advice.
Output Format
When reviewing or debugging, produce:
- Success contract
- Execution-path summary
- Architecture verdict
- Primary bottleneck
- Findings
- Suggested changes
- Eval plan
When designing, produce:
- Success contract
- Proposed execution path
- Architecture rationale
- Prompt skeleton
- Tool and schema design
- Runtime policy and guardrails
- Eval plan
Exit Criteria
The work is complete only when:
- the success contract is explicit
- the architecture choice is justified
- the biggest likely bottleneck is named
- prompt, tools, outputs, runtime, boundaries, and eval gaps are each addressed or explicitly ruled out
- recommendations are ordered from smallest effective change to larger redesign
- the eval plan can measure improvement