Run any Skill in Manus with one click

compensation-benchmarking

This skill evaluates and selects the optimal model tier for each agent role based on performance benchmarks. Use when asked to choose models for agents, compare model performance, or optimize model-to-role assignments. Also consider when new models are released. Suggest when the user assigns models to agents without benchmarking.

Run Skill in Manus

Overview

Install command

npx skills add https://github.com/mittuled/skill-os --skill compensation-benchmarking

Copy and paste this command into Claude Code to install the skill

Source

mittuled/skill-os

Stars1

Forks0

UpdatedApril 22, 2026 at 06:43

File Explorer

8 files

SKILL.md

readonly

name	compensation-benchmarking
description	This skill evaluates and selects the optimal model tier for each agent role based on performance benchmarks. Use when asked to choose models for agents, compare model performance, or optimize model-to-role assignments. Also consider when new models are released. Suggest when the user assigns models to agents without benchmarking.
department	agent-operations
agent	agent-configuration-manager
version	1.0.0
complexity	medium
related-skills	["409a-valuation-commissioner","annual-comp-review-runner","technical-skills-programme"]
triggers	["benchmark agent models","compare compute costs","model cost benchmarking","agent compensation review","compute cost comparison"]

compensation-benchmarking

Agent: Agent Configuration Manager

L2 Agent Configuration Manager (1x) responsible for model selection per agent, compute budget allocation, context window sizing, tool access policies, and API key management.

Department ethos: ideal-agent-operations.md

Skill Description

Evaluates and selects the optimal model tier for each agent role based on performance benchmarks and cost-effectiveness analysis.

When to Use

When provisioning new agents and the appropriate model tier needs to be selected.
When new model versions are released and existing assignments should be re-evaluated.
When an agent is underperforming and a model upgrade may resolve the issue.

Workflow

Define Benchmark Suite: Create role-specific benchmark tasks that test the capabilities each agent role requires -- reasoning depth, accuracy, speed, instruction following, and domain knowledge. Deliverable: benchmark suite per agent role.
Run Benchmarks: Test each candidate model against the benchmark suite for the target role. Record accuracy, latency, cost per token, and qualitative output assessment. Deliverable: benchmark results matrix.
Analyze Cost-Performance Trade-offs: Plot performance against cost for each model-role combination. Identify the efficient frontier -- models that offer the best performance at each price point. Deliverable: cost-performance analysis with recommendations.
Select and Assign: Choose the optimal model for each role based on performance requirements and budget constraints. Document the rationale for each selection. Deliverable: model assignment decisions with justification.

Anti-Patterns

Benchmark on generic tasks: Testing models with generic benchmarks (e.g., MMLU) instead of role-specific tasks. Why: generic benchmarks do not predict how a model performs on the specific tasks the agent will execute in production.
Always choosing the biggest model: Defaulting to the most capable model for every role regardless of task complexity. Why: overqualified models waste budget; a simple extraction task does not need a frontier reasoning model.
One-time benchmarking: Running benchmarks only at initial setup and never re-evaluating as models evolve. Why: model capabilities change with updates; a model that was best six months ago may be surpassed.

Output

On success: Produces a model assignment report containing benchmark results, cost-performance analysis, and model-to-role assignments with documented rationale. Delivered to the VP Agent Operations.

On failure: Report which roles could not be benchmarked (missing test cases, unavailable models), what partial results were obtained, and what is needed to complete the evaluation.

Related Skills

409a-valuation-commissioner -- ROI assessment validates whether model selections deliver expected value.
annual-comp-review-runner -- Annual review re-evaluates model assignments.
technical-skills-programme -- Model changes trigger retraining requirements.

More from this repository

same repository

alerting-configurator-data

mittuled/skill-os

This skill configures metric alerts and anomaly detection rules for analytics pipelines. Use when asked to set up threshold alerts, configure anomaly detection, or define escalation policies for KPI movements. Also consider when a new metric dashboard ships without alerting coverage. Suggest when a north star metric changes or a new instrumentation spec is approved without corresponding alerts.

2026-04-221

effort-estimator-data

mittuled/skill-os

This skill estimates analytics and data engineering effort for product initiatives. Use when asked to size instrumentation work, estimate dashboard build time, or scope a data pipeline. Also consider when sprint planning requires analytics capacity allocation. Suggest when a PRD is approved without a data workstream estimate.

2026-04-221

goal-framer-data

mittuled/skill-os

This skill defines measurable analytics goals aligned to product and business objectives. Use when asked to set data team OKRs, define success metrics for an initiative, or translate business goals into queryable KPIs. Also consider when a quarterly planning cycle begins without analytics goals. Suggest when product goals lack quantitative success criteria.

2026-04-221

instrumentation-clarity-reviewer

mittuled/skill-os

This skill reviews instrumentation plans for clarity, completeness, and measurability. Use when asked to QA an instrumentation spec, review event naming conventions, or validate property coverage. Also consider when a new instrumentation spec is drafted. Suggest when implementation is about to begin on an unreviewed spec.

2026-04-221

instrumentation-planner-data

mittuled/skill-os

This skill plans the instrumentation implementation approach across client and server. Use when asked to create an instrumentation rollout plan, coordinate SDK integration, or sequence event deployment across platforms. Also consider when a new product surface launches without a data collection strategy. Suggest when an instrumentation spec is approved but no implementation plan exists.

2026-04-221

instrumentation-spec-data

mittuled/skill-os

This skill writes the instrumentation specification defining what events and properties to track. Use when asked to design an event taxonomy, define tracking properties, or create a measurement plan. Also consider when a new feature enters development without a tracking spec. Suggest when a PRD has success metrics but no corresponding event definitions.

2026-04-221

Source

mittuled

mittuled/skill-os

View GitHub Repository View Creator Repositories

Install command

Download

Run Skill in Manus

Useful forSOC

Compensation and Benefits ManagersManagement Occupations11-3111L4

name	compensation-benchmarking
description	This skill evaluates and selects the optimal model tier for each agent role based on performance benchmarks. Use when asked to choose models for agents, compare model performance, or optimize model-to-role assignments. Also consider when new models are released. Suggest when the user assigns models to agents without benchmarking.
department	agent-operations
agent	agent-configuration-manager
version	1.0.0
complexity	medium
related-skills	["409a-valuation-commissioner","annual-comp-review-runner","technical-skills-programme"]
triggers	["benchmark agent models","compare compute costs","model cost benchmarking","agent compensation review","compute cost comparison"]

compensation-benchmarking

Agent: Agent Configuration Manager

L2 Agent Configuration Manager (1x) responsible for model selection per agent, compute budget allocation, context window sizing, tool access policies, and API key management.

Department ethos: ideal-agent-operations.md

Skill Description

Evaluates and selects the optimal model tier for each agent role based on performance benchmarks and cost-effectiveness analysis.

When to Use

When provisioning new agents and the appropriate model tier needs to be selected.
When new model versions are released and existing assignments should be re-evaluated.
When an agent is underperforming and a model upgrade may resolve the issue.

Workflow

Define Benchmark Suite: Create role-specific benchmark tasks that test the capabilities each agent role requires -- reasoning depth, accuracy, speed, instruction following, and domain knowledge. Deliverable: benchmark suite per agent role.
Run Benchmarks: Test each candidate model against the benchmark suite for the target role. Record accuracy, latency, cost per token, and qualitative output assessment. Deliverable: benchmark results matrix.
Analyze Cost-Performance Trade-offs: Plot performance against cost for each model-role combination. Identify the efficient frontier -- models that offer the best performance at each price point. Deliverable: cost-performance analysis with recommendations.
Select and Assign: Choose the optimal model for each role based on performance requirements and budget constraints. Document the rationale for each selection. Deliverable: model assignment decisions with justification.

Anti-Patterns

Benchmark on generic tasks: Testing models with generic benchmarks (e.g., MMLU) instead of role-specific tasks. Why: generic benchmarks do not predict how a model performs on the specific tasks the agent will execute in production.
Always choosing the biggest model: Defaulting to the most capable model for every role regardless of task complexity. Why: overqualified models waste budget; a simple extraction task does not need a frontier reasoning model.
One-time benchmarking: Running benchmarks only at initial setup and never re-evaluating as models evolve. Why: model capabilities change with updates; a model that was best six months ago may be surpassed.

Output

On failure: Report which roles could not be benchmarked (missing test cases, unavailable models), what partial results were obtained, and what is needed to complete the evaluation.

Related Skills

409a-valuation-commissioner -- ROI assessment validates whether model selections deliver expected value.
annual-comp-review-runner -- Annual review re-evaluates model assignments.
technical-skills-programme -- Model changes trigger retraining requirements.