| name | agent-finops |
| description | Design cost-efficient AI agent architectures. Use when optimizing token usage, selecting model tiers, budgeting compute costs, implementing caching strategies, or designing plan-and-execute patterns for cost reduction. Covers model tiering (frontier for planning, cheap for execution), token budgeting, response caching, the plan-and-execute cost reduction pattern (up to 90% savings), and cost monitoring. Based on emerging FinOps-for-AI trends, heterogeneous model architectures, and production cost optimization practices. |
Agent FinOps
Design cost-efficient AI agent architectures with model tiering, token budgeting, and caching.
Workflow
Cost Optimization Workflow
- Audit current token usage per agent component
- Classify tasks by complexity (planning vs execution)
- Assign model tiers to each task class
- Implement caching for repeated queries
- Set up cost monitoring and alerts
Cost Audit Workflow
- Measure tokens consumed per agent interaction
- Identify the most expensive operations
- Check for cacheable or downgradable operations
- Calculate potential savings from model tiering
- Generate cost reduction recommendations
Model Tiering
Three model tiers for different task complexities. Read references for templates.
| Tier | Model Class | Use For | Reference |
|---|
| Frontier | GPT-4o, Claude Opus, Gemini Ultra | Complex reasoning, planning, orchestration | references/01-model-tiering.md |
| Mid-Tier | GPT-4o-mini, Claude Sonnet, Gemini Pro | Standard tasks, code generation | references/01-model-tiering.md |
| Economy | GPT-3.5, Claude Haiku, Gemini Flash | High-frequency, simple execution | references/01-model-tiering.md |
Plan-and-Execute Pattern
Read the reference for the cost reduction architecture.
| Component | Description | Reference |
|---|
| Planner | Frontier model creates strategy (high cost, low frequency) | references/02-plan-and-execute.md |
| Executor | Economy model follows plan (low cost, high frequency) | references/02-plan-and-execute.md |
| Verifier | Mid-tier model checks results (medium cost, as needed) | references/02-plan-and-execute.md |
Token Optimization
Read the reference for token reduction strategies.
| Strategy | Savings | Reference |
|---|
| Response Caching | 40-80% for repeated queries | references/03-token-optimization.md |
| Structured Outputs | 20-40% vs free-form text | references/03-token-optimization.md |
| Context Compression | 30-50% on conversation history | references/03-token-optimization.md |
| Batch Processing | 10-30% on similar requests | references/03-token-optimization.md |
Cost Monitoring
Read the reference for monitoring and alerting setup.
| Metric | Description | Reference |
|---|
| Cost per Interaction | Average spend per user session | references/04-cost-monitoring.md |
| Token Efficiency | Useful output tokens / total tokens | references/04-cost-monitoring.md |
| Cache Hit Rate | Percentage of requests served from cache | references/04-cost-monitoring.md |
| Model Tier Distribution | Percentage of requests per tier | references/04-cost-monitoring.md |
Anti-Patterns
- Frontier Everything — using the most expensive model for all tasks
- No Caching — regenerating identical responses repeatedly
- Token Bloat — verbose system prompts consuming budget on every call
- Invisible Costs — no monitoring, no budget alerts, surprise bills
- Premature Optimization — optimizing cost before validating quality
Validation Scripts
Estimate agent operational costs with automated scoring (0-10):
python3 scripts/estimate_cost.py <prompt_file> [--strict]
Detects model references across 12 LLMs, calculates per-call and monthly costs (1K/10K calls), checks for tiering/caching/budget strategies, and flags cost anti-patterns (premium models for all requests, full history inclusion, disabled caching).