con un clic
llm-cost-optimizer
// Analyze and reduce LLM spend by mapping call-site overrides to managed profiles (Balanced / Quality / Speed). Covers spend analysis, profile assignment, and config correctness.
// Analyze and reduce LLM spend by mapping call-site overrides to managed profiles (Balanced / Quality / Speed). Covers spend analysis, profile assignment, and config correctness.
| name | LLM Cost Optimizer |
| description | Analyze and reduce LLM spend by mapping call-site overrides to managed profiles (Balanced / Quality / Speed). Covers spend analysis, profile assignment, and config correctness. |
| metadata | {"vellum":{"emoji":"💸"}} |
This skill walks through analyzing and reducing LLM spend on a Vellum assistant. There are three layers:
anthropic-managed, my-personal-key)balanced, quality-optimized, cost-optimized.llm.callSites.<id>) — per-task model/profile pinning. Falls back to llm.default when absent.UI labels for the three managed profiles:
balanced → Balanced (Sonnet, good for agent loop)quality-optimized → Quality (Opus, for hard tasks)cost-optimized → Speed (Haiku, for utility/background tasks)llm.defaultIf llm.default is Opus (or any expensive model), every call site without an explicit override burns that rate. Don't rely on just patching a few overrides — use the complete turnkey blob in Step 5 to cover every call site at once.
# Weekly totals
assistant usage totals --range week
# Break down by call site (most useful — shows what's expensive)
assistant usage breakdown --group-by call_site --range week
# Break down by model
assistant usage breakdown --group-by model --range week
# Break down by profile
assistant usage breakdown --group-by inference_profile --range week
Check llm.default — if it's pointing at Opus, that's your biggest risk:
assistant config get llm.default
assistant config get llm.callSites
assistant config get llm.profiles
assistant inference providers connections list
| Profile | Call Sites |
|---|---|
balanced (Sonnet) | mainAgent, subagentSpawn, compactionAgent, analyzeConversation, patternScan, narrativeRefinement, memoryRouter, memoryConsolidation |
cost-optimized (Haiku) | Everything else — memory extraction/retrieval, UI copy, classifiers, summarization, background tasks |
quality-optimized (Opus) | Do not pin. Reserved for on-demand user escalation via /model |
assistant config set llm.callSites.<key> '{...}' with a JSON object replaces the entire llm.callSites block, not just that key.
assistant config set llm.callSites.mainAgent.profile balancedllm.callSites as a single JSON blob (see Step 5)assistant config set llm.callSites.memoryExtraction '{"profile":"cost-optimized"}' — wipes all other overrides❌ Wrong (shows "Custom" with empty provider/model in UI, won't track profile updates):
assistant config set llm.callSites.memoryExtraction.model claude-haiku-4-5-20251001
✅ Correct (shows "Speed" in UI):
assistant config set llm.callSites.memoryExtraction.profile cost-optimized
profile sets provider/model/connection. You can still add effort, maxTokens, temperature, thinking, contextWindow alongside it:
{
"profile": "cost-optimized",
"maxTokens": 4096,
"effort": "low",
"temperature": 0,
"thinking": { "enabled": false, "streamThinking": false }
}
This covers every known call site — nothing falls back to default. Copy, paste, apply:
assistant config set llm.callSites '{
"mainAgent": {"profile":"balanced"},
"subagentSpawn": {"profile":"balanced"},
"compactionAgent": {"profile":"balanced"},
"analyzeConversation": {"profile":"balanced"},
"patternScan": {"profile":"balanced"},
"narrativeRefinement": {"profile":"balanced"},
"memoryRouter": {"profile":"balanced","contextWindow":{"maxInputTokens":1000000}},
"heartbeatAgent": {"profile":"cost-optimized","maxTokens":2048,"effort":"low","temperature":0,"thinking":{"enabled":false,"streamThinking":false},"contextWindow":{"maxInputTokens":16000}},
"filingAgent": {"profile":"cost-optimized"},
"callAgent": {"profile":"cost-optimized"},
"proactiveArtifactDecision":{"profile":"cost-optimized"},
"proactiveArtifactBuild": {"profile":"cost-optimized"},
"memoryExtraction": {"profile":"cost-optimized"},
"memoryConsolidation": {"profile":"balanced"},
"memoryRetrieval": {"profile":"cost-optimized"},
"recall": {"profile":"cost-optimized","maxTokens":4096,"effort":"low","thinking":{"enabled":false,"streamThinking":false},"temperature":0},
"memoryV2Migration": {"profile":"cost-optimized"},
"memoryV2Sweep": {"profile":"cost-optimized"},
"memoryV2Consolidation": {"profile":"cost-optimized"},
"conversationSummarization":{"profile":"cost-optimized"},
"commitMessage": {"profile":"cost-optimized","maxTokens":120,"temperature":0.2,"effort":"low","thinking":{"enabled":false}},
"conversationStarters": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"replySuggestion": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"conversationTitle": {"profile":"cost-optimized"},
"identityIntro": {"profile":"cost-optimized"},
"emptyStateGreeting": {"profile":"cost-optimized"},
"guardianQuestionCopy": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"approvalCopy": {"profile":"cost-optimized"},
"approvalConversation": {"profile":"cost-optimized"},
"feedEventCopy": {"profile":"cost-optimized"},
"trustRuleSuggestion": {"profile":"cost-optimized"},
"notificationDecision": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"preferenceExtraction": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"interactionClassifier": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"styleAnalyzer": {"profile":"cost-optimized"},
"inviteInstructionGenerator":{"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"skillCategoryInference": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"meetConsentMonitor": {"profile":"cost-optimized"},
"meetChatOpportunity": {"profile":"cost-optimized"},
"inference": {"profile":"cost-optimized"}
}'
Then set the active (default) profile to balanced:
assistant config set llm.activeProfile balanced
This controls what the app shows as the selected profile in the UI, and matters because of a platform quirk: llm.activeProfile takes priority over llm.callSites.mainAgent in the resolver (inverted vs all other call sites). Setting both to balanced keeps them aligned.
Then verify:
assistant config get llm.callSites
assistant config get llm.activeProfile
Don't pin any call site to quality-optimized. Keep it available as a session:
# User types /model quality-optimized in chat, or:
assistant inference session open quality-optimized --ttl 30m
assistant inference session list
assistant inference session close
If the user has a personal API key, wire it as a custom profile:
assistant keys set anthropic sk-ant-...
assistant inference providers connections create my-anthropic-key \
--provider anthropic \
--auth api_key \
--credential credential/anthropic/api_key
assistant config set llm.profiles.opus-personal '{"provider":"anthropic","model":"claude-opus-4-7","label":"Opus (Personal)","provider_connection":"my-anthropic-key"}'
assistant usage totals --range today
assistant usage breakdown --group-by call_site --range today
If a specific call site degrades, bump just that one back to balanced:
# e.g. if memory extraction quality drops:
assistant config set llm.callSites.memoryExtraction.profile balanced
assistant inference providers connections list
assistant inference providers connections get <name>
assistant inference providers connections create <name> --provider <p> --auth api_key --credential <vault-key>
assistant inference providers connections update <name> --auth platform
assistant inference providers connections delete <name>
Canonical connections seeded on every boot: anthropic-managed, openai-managed, gemini-managed (auth=platform, no key needed).
call_site | inference_profile | model | provider | conversation | actor
today | week | month | all | or explicit --from/--to epoch-ms