Ejecuta cualquier Skill en Manus
con un clic

Ejecuta cualquier Skill en Manus con un clic

$pwd:

llm-cost-optimizer

Name: Llm Cost Optimizer
Author: vellum-ai

// Analyze and reduce LLM spend by mapping call-site overrides to managed profiles (Balanced / Quality / Speed). Covers spend analysis, profile assignment, and config correctness.

Ejecutar en Manus

$ git log --oneline --stat

stars:1

forks:0

updated:13 de mayo de 2026, 18:54

SKILL.md

readonly

package.json

"author": "vellum-ai"

"repository": "vellum-ai/llm-cost-optimizer"

Abrir repositorio de GitHub Ver repositorios del creador

$ install --global

$ download --local

Ejecutar en Manus

$ useful --forSOC

Desarrolladores de softwareOcupaciones informáticas y matemáticas15-1252L4

Ejecuta cualquier Skill con un clic

name	LLM Cost Optimizer
description	Analyze and reduce LLM spend by mapping call-site overrides to managed profiles (Balanced / Quality / Speed). Covers spend analysis, profile assignment, and config correctness.
metadata	{"vellum":{"emoji":"💸"}}

Overview

This skill walks through analyzing and reducing LLM spend on a Vellum assistant. There are three layers:

Provider connections — named auth configs (e.g. anthropic-managed, my-personal-key)
Model profiles — named presets (model + effort + thinking + contextWindow). Three managed defaults: balanced, quality-optimized, cost-optimized.
Call-site overrides (llm.callSites.<id>) — per-task model/profile pinning. Falls back to llm.default when absent.

UI labels for the three managed profiles:

balanced → Balanced (Sonnet, good for agent loop)
quality-optimized → Quality (Opus, for hard tasks)
cost-optimized → Speed (Haiku, for utility/background tasks)

🚨 Critical: unoverridden call sites fall back to `llm.default`

If llm.default is Opus (or any expensive model), every call site without an explicit override burns that rate. Don't rely on just patching a few overrides — use the complete turnkey blob in Step 5 to cover every call site at once.

Step 1 — Understand current spend

# Weekly totals
assistant usage totals --range week

# Break down by call site (most useful — shows what's expensive)
assistant usage breakdown --group-by call_site --range week

# Break down by model
assistant usage breakdown --group-by model --range week

# Break down by profile
assistant usage breakdown --group-by inference_profile --range week

Check llm.default — if it's pointing at Opus, that's your biggest risk:

assistant config get llm.default

Step 2 — Read current overrides

assistant config get llm.callSites
assistant config get llm.profiles
assistant inference providers connections list

Step 3 — Recommended profile assignment

Profile	Call Sites
`balanced` (Sonnet)	`mainAgent`, `subagentSpawn`, `compactionAgent`, `analyzeConversation`, `patternScan`, `narrativeRefinement`, `memoryRouter`, `memoryConsolidation`
`cost-optimized` (Haiku)	Everything else — memory extraction/retrieval, UI copy, classifiers, summarization, background tasks
`quality-optimized` (Opus)	Do not pin. Reserved for on-demand user escalation via `/model`

Step 4 — Config gotchas

⚠️ JSON object value replaces the entire block

assistant config set llm.callSites.<key> '{...}' with a JSON object replaces the entire llm.callSites block, not just that key.

✅ Single leaf value (safe): assistant config set llm.callSites.mainAgent.profile balanced
✅ Multiple / object values: always set llm.callSites as a single JSON blob (see Step 5)
❌ Never do: assistant config set llm.callSites.memoryExtraction '{"profile":"cost-optimized"}' — wipes all other overrides

⚠️ Always use profile references — never direct model

❌ Wrong (shows "Custom" with empty provider/model in UI, won't track profile updates):

assistant config set llm.callSites.memoryExtraction.model claude-haiku-4-5-20251001

✅ Correct (shows "Speed" in UI):

assistant config set llm.callSites.memoryExtraction.profile cost-optimized

Profile + tuning fields can coexist

profile sets provider/model/connection. You can still add effort, maxTokens, temperature, thinking, contextWindow alongside it:

{
  "profile": "cost-optimized",
  "maxTokens": 4096,
  "effort": "low",
  "temperature": 0,
  "thinking": { "enabled": false, "streamThinking": false }
}

Step 5 — Apply the complete turnkey blob

This covers every known call site — nothing falls back to default. Copy, paste, apply:

assistant config set llm.callSites '{
  "mainAgent":                {"profile":"balanced"},
  "subagentSpawn":            {"profile":"balanced"},
  "compactionAgent":          {"profile":"balanced"},
  "analyzeConversation":      {"profile":"balanced"},
  "patternScan":              {"profile":"balanced"},
  "narrativeRefinement":      {"profile":"balanced"},
  "memoryRouter":             {"profile":"balanced","contextWindow":{"maxInputTokens":1000000}},

  "heartbeatAgent":           {"profile":"cost-optimized","maxTokens":2048,"effort":"low","temperature":0,"thinking":{"enabled":false,"streamThinking":false},"contextWindow":{"maxInputTokens":16000}},
  "filingAgent":              {"profile":"cost-optimized"},
  "callAgent":                {"profile":"cost-optimized"},
  "proactiveArtifactDecision":{"profile":"cost-optimized"},
  "proactiveArtifactBuild":   {"profile":"cost-optimized"},

  "memoryExtraction":         {"profile":"cost-optimized"},
  "memoryConsolidation":      {"profile":"balanced"},
  "memoryRetrieval":          {"profile":"cost-optimized"},
  "recall":                   {"profile":"cost-optimized","maxTokens":4096,"effort":"low","thinking":{"enabled":false,"streamThinking":false},"temperature":0},
  "memoryV2Migration":        {"profile":"cost-optimized"},
  "memoryV2Sweep":            {"profile":"cost-optimized"},
  "memoryV2Consolidation":    {"profile":"cost-optimized"},

  "conversationSummarization":{"profile":"cost-optimized"},
  "commitMessage":            {"profile":"cost-optimized","maxTokens":120,"temperature":0.2,"effort":"low","thinking":{"enabled":false}},

  "conversationStarters":     {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "replySuggestion":          {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "conversationTitle":        {"profile":"cost-optimized"},
  "identityIntro":            {"profile":"cost-optimized"},
  "emptyStateGreeting":       {"profile":"cost-optimized"},
  "guardianQuestionCopy":     {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "approvalCopy":             {"profile":"cost-optimized"},
  "approvalConversation":     {"profile":"cost-optimized"},
  "feedEventCopy":            {"profile":"cost-optimized"},
  "trustRuleSuggestion":      {"profile":"cost-optimized"},

  "notificationDecision":     {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "preferenceExtraction":     {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},

  "interactionClassifier":    {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "styleAnalyzer":            {"profile":"cost-optimized"},
  "inviteInstructionGenerator":{"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "skillCategoryInference":   {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "meetConsentMonitor":       {"profile":"cost-optimized"},
  "meetChatOpportunity":      {"profile":"cost-optimized"},
  "inference":                {"profile":"cost-optimized"}
}'

Then set the active (default) profile to balanced:

assistant config set llm.activeProfile balanced

This controls what the app shows as the selected profile in the UI, and matters because of a platform quirk: llm.activeProfile takes priority over llm.callSites.mainAgent in the resolver (inverted vs all other call sites). Setting both to balanced keeps them aligned.

Then verify:

assistant config get llm.callSites
assistant config get llm.activeProfile

Step 6 — Escalation path (on-demand Opus)

Don't pin any call site to quality-optimized. Keep it available as a session:

# User types /model quality-optimized in chat, or:
assistant inference session open quality-optimized --ttl 30m
assistant inference session list
assistant inference session close

If the user has a personal API key, wire it as a custom profile:

assistant keys set anthropic sk-ant-...

assistant inference providers connections create my-anthropic-key \
  --provider anthropic \
  --auth api_key \
  --credential credential/anthropic/api_key

assistant config set llm.profiles.opus-personal '{"provider":"anthropic","model":"claude-opus-4-7","label":"Opus (Personal)","provider_connection":"my-anthropic-key"}'

Step 7 — Verify and monitor

assistant usage totals --range today
assistant usage breakdown --group-by call_site --range today

If a specific call site degrades, bump just that one back to balanced:

# e.g. if memory extraction quality drops:
assistant config set llm.callSites.memoryExtraction.profile balanced

Reference: provider connections

assistant inference providers connections list
assistant inference providers connections get <name>
assistant inference providers connections create <name> --provider <p> --auth api_key --credential <vault-key>
assistant inference providers connections update <name> --auth platform
assistant inference providers connections delete <name>

Canonical connections seeded on every boot: anthropic-managed, openai-managed, gemini-managed (auth=platform, no key needed).

Reference: usage breakdown group-by values

Reference: usage time ranges

today | week | month | all | or explicit --from/--to epoch-ms

name	LLM Cost Optimizer
description	Analyze and reduce LLM spend by mapping call-site overrides to managed profiles (Balanced / Quality / Speed). Covers spend analysis, profile assignment, and config correctness.
metadata	{"vellum":{"emoji":"💸"}}

Overview

This skill walks through analyzing and reducing LLM spend on a Vellum assistant. There are three layers:

Provider connections — named auth configs (e.g. anthropic-managed, my-personal-key)
Model profiles — named presets (model + effort + thinking + contextWindow). Three managed defaults: balanced, quality-optimized, cost-optimized.
Call-site overrides (llm.callSites.<id>) — per-task model/profile pinning. Falls back to llm.default when absent.

UI labels for the three managed profiles:

balanced → Balanced (Sonnet, good for agent loop)
quality-optimized → Quality (Opus, for hard tasks)
cost-optimized → Speed (Haiku, for utility/background tasks)

🚨 Critical: unoverridden call sites fall back to `llm.default`

Step 1 — Understand current spend

# Weekly totals
assistant usage totals --range week

# Break down by call site (most useful — shows what's expensive)
assistant usage breakdown --group-by call_site --range week

# Break down by model
assistant usage breakdown --group-by model --range week

# Break down by profile
assistant usage breakdown --group-by inference_profile --range week

Check llm.default — if it's pointing at Opus, that's your biggest risk:

assistant config get llm.default

Step 2 — Read current overrides

assistant config get llm.callSites
assistant config get llm.profiles
assistant inference providers connections list

Step 3 — Recommended profile assignment

Profile	Call Sites
`balanced` (Sonnet)	`mainAgent`, `subagentSpawn`, `compactionAgent`, `analyzeConversation`, `patternScan`, `narrativeRefinement`, `memoryRouter`, `memoryConsolidation`
`cost-optimized` (Haiku)	Everything else — memory extraction/retrieval, UI copy, classifiers, summarization, background tasks
`quality-optimized` (Opus)	Do not pin. Reserved for on-demand user escalation via `/model`

Step 4 — Config gotchas

⚠️ JSON object value replaces the entire block

assistant config set llm.callSites.<key> '{...}' with a JSON object replaces the entire llm.callSites block, not just that key.

✅ Single leaf value (safe): assistant config set llm.callSites.mainAgent.profile balanced
✅ Multiple / object values: always set llm.callSites as a single JSON blob (see Step 5)
❌ Never do: assistant config set llm.callSites.memoryExtraction '{"profile":"cost-optimized"}' — wipes all other overrides

⚠️ Always use profile references — never direct model

❌ Wrong (shows "Custom" with empty provider/model in UI, won't track profile updates):

assistant config set llm.callSites.memoryExtraction.model claude-haiku-4-5-20251001

✅ Correct (shows "Speed" in UI):

assistant config set llm.callSites.memoryExtraction.profile cost-optimized

Profile + tuning fields can coexist

profile sets provider/model/connection. You can still add effort, maxTokens, temperature, thinking, contextWindow alongside it:

{
  "profile": "cost-optimized",
  "maxTokens": 4096,
  "effort": "low",
  "temperature": 0,
  "thinking": { "enabled": false, "streamThinking": false }
}

Step 5 — Apply the complete turnkey blob

This covers every known call site — nothing falls back to default. Copy, paste, apply:

assistant config set llm.callSites '{
  "mainAgent":                {"profile":"balanced"},
  "subagentSpawn":            {"profile":"balanced"},
  "compactionAgent":          {"profile":"balanced"},
  "analyzeConversation":      {"profile":"balanced"},
  "patternScan":              {"profile":"balanced"},
  "narrativeRefinement":      {"profile":"balanced"},
  "memoryRouter":             {"profile":"balanced","contextWindow":{"maxInputTokens":1000000}},

  "heartbeatAgent":           {"profile":"cost-optimized","maxTokens":2048,"effort":"low","temperature":0,"thinking":{"enabled":false,"streamThinking":false},"contextWindow":{"maxInputTokens":16000}},
  "filingAgent":              {"profile":"cost-optimized"},
  "callAgent":                {"profile":"cost-optimized"},
  "proactiveArtifactDecision":{"profile":"cost-optimized"},
  "proactiveArtifactBuild":   {"profile":"cost-optimized"},

  "memoryExtraction":         {"profile":"cost-optimized"},
  "memoryConsolidation":      {"profile":"balanced"},
  "memoryRetrieval":          {"profile":"cost-optimized"},
  "recall":                   {"profile":"cost-optimized","maxTokens":4096,"effort":"low","thinking":{"enabled":false,"streamThinking":false},"temperature":0},
  "memoryV2Migration":        {"profile":"cost-optimized"},
  "memoryV2Sweep":            {"profile":"cost-optimized"},
  "memoryV2Consolidation":    {"profile":"cost-optimized"},

  "conversationSummarization":{"profile":"cost-optimized"},
  "commitMessage":            {"profile":"cost-optimized","maxTokens":120,"temperature":0.2,"effort":"low","thinking":{"enabled":false}},

  "conversationStarters":     {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "replySuggestion":          {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "conversationTitle":        {"profile":"cost-optimized"},
  "identityIntro":            {"profile":"cost-optimized"},
  "emptyStateGreeting":       {"profile":"cost-optimized"},
  "guardianQuestionCopy":     {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "approvalCopy":             {"profile":"cost-optimized"},
  "approvalConversation":     {"profile":"cost-optimized"},
  "feedEventCopy":            {"profile":"cost-optimized"},
  "trustRuleSuggestion":      {"profile":"cost-optimized"},

  "notificationDecision":     {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "preferenceExtraction":     {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},

  "interactionClassifier":    {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "styleAnalyzer":            {"profile":"cost-optimized"},
  "inviteInstructionGenerator":{"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "skillCategoryInference":   {"profile":"cost-optimized","effort":"low","thinking":{"enabled":false}},
  "meetConsentMonitor":       {"profile":"cost-optimized"},
  "meetChatOpportunity":      {"profile":"cost-optimized"},
  "inference":                {"profile":"cost-optimized"}
}'

Then set the active (default) profile to balanced:

assistant config set llm.activeProfile balanced

Then verify:

assistant config get llm.callSites
assistant config get llm.activeProfile

Step 6 — Escalation path (on-demand Opus)

Don't pin any call site to quality-optimized. Keep it available as a session:

# User types /model quality-optimized in chat, or:
assistant inference session open quality-optimized --ttl 30m
assistant inference session list
assistant inference session close

If the user has a personal API key, wire it as a custom profile:

assistant keys set anthropic sk-ant-...

assistant inference providers connections create my-anthropic-key \
  --provider anthropic \
  --auth api_key \
  --credential credential/anthropic/api_key

assistant config set llm.profiles.opus-personal '{"provider":"anthropic","model":"claude-opus-4-7","label":"Opus (Personal)","provider_connection":"my-anthropic-key"}'

Step 7 — Verify and monitor

assistant usage totals --range today
assistant usage breakdown --group-by call_site --range today

If a specific call site degrades, bump just that one back to balanced:

# e.g. if memory extraction quality drops:
assistant config set llm.callSites.memoryExtraction.profile balanced

Reference: provider connections

assistant inference providers connections list
assistant inference providers connections get <name>
assistant inference providers connections create <name> --provider <p> --auth api_key --credential <vault-key>
assistant inference providers connections update <name> --auth platform
assistant inference providers connections delete <name>

Canonical connections seeded on every boot: anthropic-managed, openai-managed, gemini-managed (auth=platform, no key needed).

Reference: usage breakdown group-by values

Reference: usage time ranges

today | week | month | all | or explicit --from/--to epoch-ms

llm-cost-optimizer

Overview

🚨 Critical: unoverridden call sites fall back to llm.default

Step 1 — Understand current spend

Step 2 — Read current overrides

Step 3 — Recommended profile assignment

Step 4 — Config gotchas

⚠️ JSON object value replaces the entire block

⚠️ Always use profile references — never direct model

Profile + tuning fields can coexist

Step 5 — Apply the complete turnkey blob

Step 6 — Escalation path (on-demand Opus)

Step 7 — Verify and monitor

Reference: provider connections

Reference: usage breakdown group-by values

Reference: usage time ranges

Overview

🚨 Critical: unoverridden call sites fall back to llm.default

Step 1 — Understand current spend

Step 2 — Read current overrides

Step 3 — Recommended profile assignment

Step 4 — Config gotchas

⚠️ JSON object value replaces the entire block

⚠️ Always use profile references — never direct model

Profile + tuning fields can coexist

Step 5 — Apply the complete turnkey blob

Step 6 — Escalation path (on-demand Opus)

Step 7 — Verify and monitor

Reference: provider connections

Reference: usage breakdown group-by values

Reference: usage time ranges

🚨 Critical: unoverridden call sites fall back to `llm.default`

🚨 Critical: unoverridden call sites fall back to `llm.default`