with one click
llm-cost-optimizer
// Analyze and reduce LLM spend by mapping call-site overrides to managed profiles (Balanced / Quality / Speed). Covers spend analysis, profile assignment, and config correctness.
// Analyze and reduce LLM spend by mapping call-site overrides to managed profiles (Balanced / Quality / Speed). Covers spend analysis, profile assignment, and config correctness.
Ingest and process media files (video, audio, image)
Read, search, send, and manage messages across Gmail, Outlook, Telegram, and other platforms
Join a Google Meet call to take notes; only when the user explicitly asks.
Migrate from ChatGPT, Claude, OpenClaw, Hermes, Manus, and other AI assistants into Vellum by inspecting their data exports, conversation archives, files, prompts, custom instructions, memory, saved memories, tools, GPTs, workflows, integrations, and relationships, then mapping as much as safely possible into Vellum primitives. Handles single-source and multi-source migrations with a unified, deduplicated inventory.
Send notifications through the unified notification router
Recurring and one-shot scheduling - cron, RRULE, or single fire-at time
| name | llm-cost-optimizer |
| description | Analyze and reduce LLM spend by mapping call-site overrides to managed profiles (Balanced / Quality / Speed). Covers spend analysis, profile assignment, and config correctness. |
| metadata | {"emoji":"šø","vellum":{"display-name":"LLM Cost Optimizer"}} |
This skill walks through analyzing and reducing LLM spend on a Vellum assistant. There are three layers:
anthropic-managed, my-personal-key)balanced, quality-optimized, cost-optimized.llm.callSites.<id>) ā per-task model/profile pinning. Falls back to llm.default when absent.UI labels for the three managed profiles:
balanced ā Balanced (Sonnet, good for agent loop)quality-optimized ā Quality (Opus, for hard tasks)cost-optimized ā Speed (Haiku, for utility/background tasks)llm.defaultIf llm.default is Opus (or any expensive model), every call site without an explicit override burns that rate. Don't rely on just patching a few overrides ā use the complete turnkey blob in Step 5 to cover every call site at once.
# Weekly totals
assistant usage totals --range week
# Break down by call site (most useful ā shows what's expensive)
assistant usage breakdown --group-by call_site --range week
# Break down by model
assistant usage breakdown --group-by model --range week
# Break down by profile
assistant usage breakdown --group-by inference_profile --range week
Check llm.default ā if it's pointing at Opus, that's your biggest risk:
assistant config get llm.default
assistant config get llm.callSites
assistant config get llm.profiles
assistant inference providers connections list
| Profile | Call Sites |
|---|---|
balanced (Sonnet) | mainAgent, subagentSpawn, compactionAgent, analyzeConversation, patternScan, narrativeRefinement, memoryConsolidation, recall, callAgent, emptyStateGreeting, conversationStarters, identityIntro, proactiveArtifactBuild |
cost-optimized (Haiku) | Everything else ā memoryRouter (with 1M context override), memory extraction/retrieval, UI copy, classifiers, summarization, background tasks |
quality-optimized (Opus) | Do not pin. Reserved for on-demand user escalation via /model |
assistant config set llm.callSites.<key> '{...}' with a JSON object replaces the entire llm.callSites block, not just that key.
assistant config set llm.callSites.mainAgent.profile balancedllm.callSites as a single JSON blob (see Step 5)assistant config set llm.callSites.memoryExtraction '{"profile":"cost-optimized"}' ā wipes all other overridesā Wrong (shows "Custom" with empty provider/model in UI, won't track profile updates):
assistant config set llm.callSites.memoryExtraction.model claude-haiku-4-5-20251001
ā Correct (shows "Speed" in UI):
assistant config set llm.callSites.memoryExtraction.profile cost-optimized
profile sets provider/model/connection. You can still add effort, maxTokens, temperature, thinking, contextWindow alongside it:
{
"profile": "cost-optimized",
"maxTokens": 4096,
"effort": "low",
"temperature": 0,
"thinking": { "enabled": false, "streamThinking": false }
}
This covers every known call site ā nothing falls back to default. Copy, paste, apply:
Note: The canonical shipped defaults live in
assistant/src/config/call-site-defaults.ts. The blob below can be used to override a user's config, but call sites without explicit user overrides already resolve to the defaults defined in that file. If new call sites have been added since this skill was written, add them there (default tocost-optimizedunless they involve reasoning or memory consolidation).
assistant config set llm.callSites '{
"mainAgent": {"profile":"balanced"},
"subagentSpawn": {"profile":"balanced"},
"compactionAgent": {"profile":"balanced"},
"analyzeConversation": {"profile":"balanced"},
"patternScan": {"profile":"balanced"},
"narrativeRefinement": {"profile":"balanced"},
"memoryRouter": {"profile":"cost-optimized","contextWindow":{"maxInputTokens":1000000}},
"heartbeatAgent": {"profile":"cost-optimized","maxTokens":2048,"effort":"low","temperature":0,"thinking":{"enabled":false,"streamThinking":false},"contextWindow":{"maxInputTokens":16000}},
"filingAgent": {"profile":"cost-optimized"},
"callAgent": {"profile":"balanced"},
"proactiveArtifactDecision":{"profile":"cost-optimized"},
"proactiveArtifactBuild": {"profile":"balanced"},
"memoryExtraction": {"profile":"cost-optimized"},
"memoryConsolidation": {"profile":"balanced"},
"memoryRetrieval": {"profile":"cost-optimized"},
"memoryRetrospective": {"profile":"cost-optimized"},
"recall": {"profile":"balanced","maxTokens":4096,"effort":"low","thinking":{"enabled":false,"streamThinking":false},"temperature":0},
"memoryV2Migration": {"profile":"cost-optimized"},
"memoryV2Sweep": {"profile":"cost-optimized"},
"memoryV2Consolidation": {"profile":"balanced"},
"conversationSummarization":{"profile":"cost-optimized"},
"commitMessage": {"profile":"cost-optimized","maxTokens":120,"temperature":0.2,"effort":"low","thinking":{"enabled":false}},
"conversationStarters": {"profile":"balanced","effort":"low","thinking":{"enabled":false}},
"replySuggestion": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"conversationTitle": {"profile":"cost-optimized"},
"identityIntro": {"profile":"balanced"},
"emptyStateGreeting": {"profile":"balanced"},
"guardianQuestionCopy": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"approvalCopy": {"profile":"cost-optimized"},
"approvalConversation": {"profile":"cost-optimized"},
"trustRuleSuggestion": {"profile":"cost-optimized"},
"notificationDecision": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"preferenceExtraction": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"interactionClassifier": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"styleAnalyzer": {"profile":"cost-optimized"},
"inviteInstructionGenerator":{"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"skillCategoryInference": {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
"meetConsentMonitor": {"profile":"cost-optimized"},
"meetChatOpportunity": {"profile":"cost-optimized"},
"inference": {"profile":"cost-optimized"}
}'
Then set the active (default) profile to balanced:
assistant config set llm.activeProfile balanced
This controls what the app shows as the selected profile in the UI, and matters because of a platform quirk: llm.activeProfile takes priority over llm.callSites.mainAgent in the resolver (inverted vs all other call sites). Setting both to balanced keeps them aligned.
Then verify:
assistant config get llm.callSites
assistant config get llm.activeProfile
Don't pin any call site to quality-optimized. Keep it available as a session:
# User types /model quality-optimized in chat, or:
assistant inference session open quality-optimized --ttl 30m
assistant inference session list
assistant inference session close
If the user has a personal API key, wire it as a custom profile:
# Collect the key securely ā never paste it in chat
credential_store prompt --service anthropic --field api_key \
--label "Anthropic API Key" --placeholder "sk-ant-..."
assistant inference providers connections create my-anthropic-key \
--provider anthropic \
--auth api_key \
--credential credential/anthropic/api_key
assistant config set llm.profiles.opus-personal '{"provider":"anthropic","model":"claude-opus-4-8","label":"Opus (Personal)","provider_connection":"my-anthropic-key"}'
assistant usage totals --range today
assistant usage breakdown --group-by call_site --range today
If a specific call site degrades, bump just that one back to balanced:
# e.g. if memory extraction quality drops:
assistant config set llm.callSites.memoryExtraction.profile balanced
assistant inference providers connections list
assistant inference providers connections get <name>
assistant inference providers connections create <name> --provider <p> --auth api_key --credential <vault-key>
assistant inference providers connections update <name> --auth platform
assistant inference providers connections delete <name>
Canonical connections seeded on every boot: anthropic-managed, openai-managed, gemini-managed (auth=platform, no key needed).
call_site | inference_profile | model | provider | conversation | actor
today | week | month | all | or explicit --from/--to epoch-ms